 Research
 Open Access
 Published:
Estimating loop length from CryoEM images at medium resolutions
BMC Structural Biologyvolume 13, Article number: S5 (2013)
Abstract
Background
De novo protein modeling approaches utilize 3dimensional (3D) images derived from electron cryomicroscopy (CryoEM) experiments. The skeleton connecting two secondary structures such as αhelices represent the loop in the 3D image. The accuracy of the skeleton and of the detected secondary structures are critical in De novo modeling. It is important to measure the length along the skeleton accurately since the length can be used as a constraint in modeling the protein.
Results
We have developed a novel computational geometric approach to derive a simplified curve in order to estimate the loop length along the skeleton. The method was tested using fifty simulated density images of helixloophelix segments of atomic structures and eighteen experimentally derived density data from Electron Microscopy Data Bank (EMDB). The test using simulated density maps shows that it is possible to estimate within 0.5Å of the expected length for 48 of the 50 cases. The experiments, involving eighteen experimentally derived CryoEM images, show that twelve cases have error within 2Å.
Conclusions
The tests using both simulated and experimentally derived images show that it is possible for our proposed method to estimate the loop length along the skeleton if the secondary structure elements, such as αhelices, can be detected accurately, and there is a continuous skeleton linking the αhelices.
Background
Over the last ten years, electron cryomicroscopy (CryoEM) experiments yielded increasing numbers of 3D electron density images of protein molecules. The Electron Microscopy Data Bank (EMDB) currently archives the 3D images, referred to as density maps in this paper, with a wide range of resolutions from 3Å to over 80Å [1]. When the density map is resolved to high resolution (35Å) [2, 3], it is possible to derive the near atomic structure from the density map. However, when the density map is not resolved to the high resolution range, it is still challenging to derive the structure of the imaged molecule [4–6]. Fitting and comparative modeling approaches have been developed to utilize the existing atomic structures in the Protein Data Bank (PDB) [6, 7]. These approaches apply when a component of the target density map has been resolved to near atomic resolution structure or when the target protein shares significant homology with existing atomic structures.
Modeling protein molecules using de novo methods is a general approach to derive the atomic structure from medium resolution (510Å) electron density 3D images [6, 8–10]. Only the 3D image (top left in Figure 1) and amino acid sequence (top right of Figure 1) are used in de novo processes. It does not need an atomic template protein structure from PDB as required for fitting and comparative modeling methods. First, the secondary structure elements (SSEs) such as αhelices (red sticks in Figure 1) and βsheets are often identified using pattern recognition methods [11–16]. Skeletonization methods detect the medial axis (green, left in Figure 1) of a 3D image's isosurface [10, 17]. Next, the amino acid sequence segments (red cylinders, right of Figure 1) of the SSEs can be predicted using existing prediction tools [18–21]. Various approaches have been developed to combine the secondary structure information from the 3D image and 1D sequence in order to derive the topology. The atomic structures can be built once the possible topologies are predicted [6–8].
An amino acid sequence has a direction, starting with a nitrogen atom (Nterminal) and ending with the a carbon atom (Cterminal). The SSE topology is the order in which this sequence traverses the protein's helices and sheets, including the direction of entry into and exit from the secondary structure. The native topology of a protein's SSEs is likely to produce the lowest energy state compared to incorrect topologies [22]. Determining the correct topology is a crucial step in de novo modeling. We have formulated the SSE topology problem into a constrained graph matching problem and provided a dynamic programming algorithm [9]. We later used a dynamic graph approach to handle errors in the data [23].
The distance between two SSEs is an important constraint in graph matching. As an example, two helices closely located in a 3D image should be matched to two helices with similar distance estimated from the 1D sequence. The distance between two ends of two helices (one on each) can be simply estimated as the Euclidean distance [9], or can be measured more accurately along the skeleton [8, 23, 24]. From the amino acid sequence input, the distance between SSEs can be estimated assuming a 3.8Å distance between adjacent amino acids in the sequence. A scoring function can be developed to represent the overall matching between two sets of SSEs, one from the 3D image and the other from the 1D sequence. The correct topology is assumed to be the one with the best match score.
Despite the relative accuracy of skeletonization algorithms, overestimation may occur if length is measured directly along their piecewise linear curves, which contain many right angles and some error from the thinning process and the 3D image itself.
Here, we extend our previous work in [25], in which we obtained preliminary results testing a computationalgeometric method to measure the length of a simplified skeleton. In addition to expanding our test set to include synthetically generated density maps and additional experimentally derived data, we used the directed Hausdorff distance to handle segmentation issues. The measured length appears to agree with the expected length when the SSEs are detected fairly accurately.
Results and discussion
Test data and overall process
Two data sets were used in testing performance. The simulated data set consists of fifty randomly selected helixloophelix (HLH) motifs from atomic structures in PDB. The proteins extracted exhibit less than 10% sequence identity. Each extracted HLH of the protein structure was used to generate a 3D density map using EMAN1.9 pdb2mrc [26]. The density maps were simulated to 8Å resolution.
The real data set consists of 18 cases whose density maps were downloaded from EMDB with resolution from 4.2Å to 6.8Å. Their EMDB entries are 5030 (6.4Å), 1733 (6.8Å), 5001 (4.2Å), 1740 (6.8Å) and 5168 (6.6Å). Each of these density maps is aligned with their PDB structures at download and provided multiple helixloophelix motif samples for the experiment.
The length of a loop was measured along the skeleton voxel points between (and including) the end points of the two surrounding helices. An endpoint of a helix represents an end of the central axis of the helix [11, 12]. The helices were detected using SSETracer, a simplified version of SSELearner [16]. The skeleton was detected using a local maximum clustering method, more details of which are forthcoming in a separate paper. In order to test the accuracy of our algorithm, we visually inspected the detected helices and included only those cases in which the helices were roughly accurate. This was done to distinguish the potential error in our loop length estimation from that of helix detection, skeletonization, or production of the CryoEM image itself.
Accuracy
The accuracy of the measurement was evaluated using both the simulated data and the real data from the EMDB. Table 1 summarizes the results for the simulated data. The input to our method includes two pieces of information: the detected helix (red sticks) end points and the skeleton voxels (red dots) (Figure 2B). Each measured length along the skeleton was compared with the expected length of the loop. The expected length was calculated as 3.8Å ×(n + 1), where n is the number of the amino acids on the loop and 3.8Å is the average distance between two amino acids.
The fifty tested cases were sorted by the length of the loop, ranging from 1 to 10 amino acids. Almost all the 50 test cases appear to have the error within 0.5Å (column 6 of Table 1). As an example, the loop in 1DU0 (row 15 of Table 1) has three amino acids and the expected length of the loop is 15.2Å. The measured length of the loop along the skeleton is 14.99Å. The relative error is 1.4% of the expected loop length. The simplified curve (blue in Figure 2B) detected by the algorithm appears to be close to the skeleton points (red dots). Another example is from 1MW8 (Figure 2 C, D, row 29 of Table 1) with six amino acids on the loop. The error of the measurement is 0.358Å in this case (column 6 of row 29, Table 1). Note that the skeleton points branch into multiple directions (Figure 2D), yet the algorithm correctly measured the length between the two ending points of the helices by using Hausdorff measurements (see Algorithm). In some cases, as in rows 18 and 28 in Table 1 the greedy step in the Hausdorff computation breaks down and the wrong pair of endpoints was used or the wrong skeleton segment was measured.
The test using the experimentally derived density data involves eighteen HLH motifs from density maps with 47Å resolution from EMDB. Twelve of the eighteen cases have measured error within 2Å, and six have error between 2Å and 5Å. The real density maps from the experiments are often more challenging with missing density and additional densities that do not align with the true structure. The helices and skeletons detected from the real maps are often less accurate than those from the simulated density maps. Figure 3 shows an example of experimentally derived data in EMDB 5168 (row 15 in Table 2). The difference between the measured and the expected distance is 2.88Å, higher than a comparable case with a synthetic density map used instead. In general, we saw an increase in error using the real density images, due to greater errors in helix detection and skeletonization induced by the noise present.
The algorithm uses a simplification parameter ∈ that is user defined. ∈ is the width of the vertex removal band (refer to the algorithm for more details). In general, the smaller the ∈ value, the less change in the simplified curve compared to the initial path. In some cases, ∈ = 0 is the best option, leaving the original path unchanged. In other cases, a much larger value of ∈ was needed. In order to see the degree of simplification that produced the most accurate results, we sampled ∈'s range inside the interval 0[6] in increments of 0.05. The measured lengths w.r.t. ∈ values appear to form a step function, and the value closest to the expected value (Figure 4 left) was marked. As seen from this case, the measured length reduces as ∈ increases stepwise.
Figure 4 (right) shows the distribution of the values of ∈ for about 800 simulated cases that had less than 0.5Å difference. The vertical lines represent values of ∈ for cases in Table 1. It appears that most of the ∈ values between 0.0 and 1.5 minimize the error in the measurement (Figure 4, right). However, we observed that we need larger ∈ values for the experimentally derived data than for the simulated density maps. This difference is likely to be associated with the quality of skeletonization and helix detection. For the simulated cases, ∈ between 0.0 and 1.5 is more likely to produce a good estimate after sufficient preprocessing of the density maps. Multiple ∈ values might be needed to sample the expected length when working with the experimentally derived cryoEM data.
Conclusions
We have developed a new approach to estimate loop length along the skeleton from a CryoEM density map. Our tests, using both simulated and experimentally derived images at medium resolution, show that it is possible for our proposed method to estimate fairly accurately the loop length along the skeleton if the SSEs such as αhelices and the skeleton are detected fairly accurately.
Methods
The overall process to measure the loop length along the skeleton consists of two tasks: preprocessing and length calculation (Figure 5). The purpose of the preprocessing is to derive the skeleton and the endpoints of the two helices from the density map. Once such information is obtained, our algorithm uses graphs and computational geometric concepts to derive the simplified curve.
Preprocessing
Each case in Table 1 had a density map generated using the HLH segment of the PDB structure and EMAN's pdb2mrc [26]. We applied a skeletonization method that utilizes the local maximum points and clustering to derive the skeleton points from the density map. The HLH regions of cases in Table 2 were extracted from entire density images downloaded from EMDB. We used SSETracer, a secondary structure detection method to detect helices from the density map. It is modified from SSELearner [16] with improved speed. Since helix detection is independent of skeletonization, it is necessary to remove the skeleton voxels that belong to the helix region in order to obtain the skeleton belonging to the loop. We removed those skeleton voxels that are within 2.3Å of the central axis of the helix. Note that a helix is 2.3  2.5Å in radius [11, 27]. After such processing, the skeleton voxels that presumably belong to the loop are segmented from the rest of the skeleton voxels and are subject for length calculation.
Algorithm
Local connectivity graphs
A local connectivity graph (LCG) represents a cluster of skeleton voxels. We impose a constraint on the maximum allowable edge length in a graph, possibly yielding multiple disconnected graphs when all skeleton voxels are considered. For our tests, we normalized the distances between the image's voxels to unity, and chose a maximum edge length l < 2, producing individual connected subcomponents if they can be clustered into distant groups, referred to as LCGs in this paper.
Selecting connected components
Oftentimes, segmented or sparse density data yield multiple LCGs. Also, in general, it is not known which helix endpoints the loop actually lies between. We must then determine the best LCG for each possible pair of helix endpoints. For two helices, one with endpoints p and q and the other with r and s, there exists a set Z of four possible endpoint pairs: Z := {{p, r}, {p, s}, {q, r}, {q, s}}. For each endpoint pair z ∈ Z, let the directed Hausdorff distance to an LCG [28] be defined as
where z is the set of helix endpoints (comprised of voxels denoted z_{ i }) and b is an LCG (comprised of voxels denoted b_{ j }) from the set B of all LCGs; d(z_{ i }, b_{ j }) is then the Euclidean distance between a helix endpoint voxel and LCG voxel. In the presence of multiple LCGs, we choose the best LCG ${\widehat{l}}_{z}$ per endpoint pair z ∈ Z by taking the minimum directed Hausdorff distance over all LCGs:
We can then use the voxels of ${\widehat{l}}_{z}$ to build our model of the loop between the endpoints of z.
It should be noted here that the directed Hausdorff is not commutativein general, h(M, N) ≠ h(N, M) and we always chose M as a set (pair) of helix endpoints, and N as an LCG. Figure 6 shows the configuration for case 30 (PDB 1O6L) from Table 1, where we want to find ${\widehat{l}}_{z}$ among the set of LCGs B := {1, 2, 3, 4, 5, 6} to search for the loop that may lie between the helix endpoint pair a. After finding ${\widehat{l}}_{z}$ using equation (2), we repeat the procedure for each other helix endpoint pair. We try connecting the helix endpoints to their respective closest voxels in ${\widehat{l}}_{z}$ with respect to the Euclidean distance. If either of the new edges connecting p or r is longer than 5Å, we discard the combination as an infeasible path.
Pathfinding
After finding the best LCG for a given possible helix endpoint pair, the next step is constructing a path that traverses it in a way that will approximate the loop. We simply performed a breadthfirst search starting from one of the helix endpoints we added, and reconstruct the path that ends at the other one in the graph [29], with a helix endpoint as the source. For a given HLH, we find four such paths, one for each possible helix endpoint pair.
Path simplification
Ideally, the distance between two specific ends of two helices should be measured along the skeleton connecting the two ends by using our initial path. If we simply add the length of the line segments along the initial path, there is a danger of over estimation due to the potential zigzagging induced from drawing a path along the edges of the cubic lattice of the 3D image.
DouglasPeucker line simplification [30, 31] is the systematic removal of points that lie beyond some distance ∈ from a line describing the general orientation of a piecewise linear curve (polyline) or one of its subsegments. Consider the twodimensional example in Figure 7. Part (i) shows an initial polyline $\overline{a...b}$. The algorithm is recursive, and takes as parameters the tolerance ∈ (Figure 7 (ii)) and a multipoint segment of a polyline. At each recursive iteration it finds an interior point of the current segment which is the most distant from the straight line connecting the end points of the segment, as in Figure 7 (ii) and 7(iii). If all of the current segment's vertices lie within the ∈ band, the segment is replaced with a straight line segment containing only its endpoints. Otherwise, the segment is split at the most distant point and each subsegment is handled recursively. In Figure 7 (iii), $\overline{ac}$ and $\overline{cb}$ are treated in different recursive calls; e is the farthest point from $\overline{cb}$, and no points lie outside the epsilon band for $\overline{ac}$. Overall, the initial polyline $\overline{a...b}$ is simplified into polyline $\overline{aceb}$, which approximates the length of the loop between helix endpoints.
Abbreviations
 CryoEM:

electron cryomicroscopy
 SSE:

secondary structure element  either αhelices or βsheets
 EMDB:

Electron Microscopy Data Bank
 PDB:

Protein Data Bank
 HLH:

helixloophelix motif found in protein structures
 LCG:

local connectivity graph  a connected graph of skeleton voxels with a maximum allowed edge length.
References
 1.
Lawson C, et al.: unified data resource for CryoEM. Nucleic Acids Res 2011, 39(Database):D456–64. [http://EMDatabank.org] 10.1093/nar/gkq880
 2.
Zhang X, Sun S, Xiang Y, Wong J, Klose T, Raoult D, Rossmann M: Structure of Sputnik, a virophage, at 3.5A resolution. Proc Nat Acad Sci USA 2012, 109: 18431–18436. 10.1073/pnas.1211702109
 3.
Zhang X, Ge P, Yu X, Brannan J, Bi G, Zhang Q, Schein S, Zhou Z: CryoEM structure of the mature dengue virus at 3.5A resolution. Nat Struct Mol Biol 2012, 20: 105–110. 10.1038/nsmb.2463
 4.
Lu Y, Strauss C, He J: Incorporation of Constraints from Low Resolution Density Map in Ab Initio Structure Prediction Using Rosetta. Proceeding of 2007 IEEE international Conference on Bioinformatics and Biomedicine Workshops 2007, 67–73.
 5.
Baker M, Jiang W, Wedemeyer W, Rixon F, Baker D, Chiu W: Ab initio modeling of the herpes virus VP26 core domain assessed by CryoEM density. PLoS Comput Biol 2006, 2(10):e146. 10.1371/journal.pcbi.0020146
 6.
Lindert S, Staritzbichler R, Wotzel N, KarakaS M, Stewart P, Meiler J: EMfold: De novo folding of alphahelical proteins guided by intermediateresolution electron microscopy density maps. Structure 2009, 17(7):990–1003. 10.1016/j.str.2009.06.001
 7.
Lindert S, Alexander N, Wötzel N, Karakas M, Stewart PL, Meiler J: EMfold: de novo atomicdetail protein structure determination from mediumresolution density maps. Structure 2012, 20(3):464–478. 10.1016/j.str.2012.01.023
 8.
Nasr KA, Chen L, Si D, Ranjan D, Zubair M, He J: Building the Initial Chain of the Proteins through De Novo Modeling of the CryoElectron Microscopy Volume Data at the Medium Resolutions. ACM Conference on Bioinformatics, Computational Biology and Biomedicine, Orlando, FL 2012.
 9.
Nasr KA, Ranjan D, Zubair M, He J: Ranking valid topologies of the secondary structure elements using a constraint graph. J Bioinform Comput Biol 2011, 9(3):415–30. 10.1142/S0219720011005604
 10.
Baker M, Abeysinghe S, Schuh S, Coleman R, Abrams A, Marsh M, Hryc C, Ruths T, Chiu W, Ju T: Modeling protein structure at near atomic resolutions with Gorgon. Journal of Structural Biology 2011, 174(2):360–373. 10.1016/j.jsb.2011.01.015
 11.
Jiang W, Baker M, Ludtke S, Chiu W: Bridging the information gap: computational tools for intermediate resolution structure interpretation. J Mol Biol 2001, 308(5):1033–44. 10.1006/jmbi.2001.4633
 12.
Palu A, He J, Pontelli E, Lu Y: Identification of AlphaHelices from Low Resolution Protein Density Maps. Proceeding of Computational Systems Bioinformatics Conference (CSB) 2006, 89–98.
 13.
Baker M, Ju T, Chiu W: Identification of secondary structure elements in intermediateresolution density maps. Structure 2007, 15: 7–19. 10.1016/j.str.2006.11.008
 14.
Kong Y, Zhang X, Baker T, Ma J: A Structuralinformatics approach for tracing betasheets: building pseudoC(alpha) traces for betastrands in intermediateresolution density maps. J Mol Biol 2004, 339: 117–30. 10.1016/j.jmb.2004.03.038
 15.
Zeyun Y, Bajaj C: Computational Approaches for Automatic Structural Analysis of Large Biomolecular Complexes. Computational Biology and Bioinformatics, IEEE/ACM Transactions on 2008, 5(4):568–582.
 16.
Si D, Ji S, Nasr K, He J: A machine learning approach for the identification of protein secondary structure elements from electron cryomicroscopy density maps. Biopolymers 2012, 97(9):698–708. 10.1002/bip.22063
 17.
Ju T, Baker ML, Chiu W: Computing a family of skeletons of volumetric models for shape description. Computer Aided Design 2007, 39(5):352–60. 10.1016/j.cad.2007.02.006
 18.
McGuffin L, Bryson K, Jones D: The PSIPRED protein structure prediction server. Bioinformatics 2000, 16(4):404–5. 10.1093/bioinformatics/16.4.404
 19.
Ward J, McGuffin L, Buxton B, Jones D: Secondary structure prediction with support vector machines. Bioinformatics 2003, 19(13):1650–5. 10.1093/bioinformatics/btg223
 20.
Pollastri G, McLysaght A: Porter: a new, accurate server for protein secondary structure prediction. Bioinformatics 2005, 21(8):1719–20. 10.1093/bioinformatics/bti203
 21.
Pollastri G, Przybylski D, Rost B, Baldi P: Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins 2002, 47(2):228–35. 10.1002/prot.10082
 22.
Sun W, He J: Native secondary structure topology has near minimum contact energy among all possible geometrically constrained topologies. Proteins 2009, 77: 159–173. 10.1002/prot.22427
 23.
Biswas A, Si D, Nasr KA, Ranjan D, Zubair M, He J: Improved efficiency in cryoEM secondary structure topology determination from inaccurate data. J Bioinform Comput Biol 2012, 10(3):1242006–11242006–16. 10.1142/S0219720012420061
 24.
Abeysinghe S, Ju T: Shape modeling and matching in identifying protein structure from low resolution images. Proceedings of the 2007 ACM symposium on Solid and physical modeling 2007.
 25.
Mcknight A, Nasr KA, Si D, Chernikov A, Chrisochoides N, He J: CryoEM skeleton length estimation using a decimated curve. Bioinformatics and Biomedicine Workshops (BIBMW), 2012 IEEE International Conference on: 4–7 October 2012 2012, 109–113. 10.1109/BIBMW.2012.6470283
 26.
Ludtke S, Baldwin P, Chiu W: EMAN: semiautomated software for highresolution singleparticle reconstructions. J Struct Biol 1999, 128: 82–97. 10.1006/jsbi.1999.4174
 27.
Whitford D: Proteins: Structure and Function. Wiley; 2005.
 28.
Veltkamp RC: Shape Matching: Similarity Measures and Algorithms. Proceedings of the International Conference on Shape Modeling and Applications SMI '01, Washington, DC, USA: IEEE Computer Society; 2001, 188. [http://dl.acm.org/citation.cfm?id=882486.884078]
 29.
Cormen T, Leierson C, Rivest R, Stein C: Introduction to Algorithms. 3rd edition. MIT Press; 2009.
 30.
Douglas D, Peucker T: Algorithms for the Reduction of the Number of Points Required to Represent a Digitized Line or its Caricature. Cartograhica 1973, 10(2):112–122. 10.3138/FM576770U75U7727
 31.
Hershberger J, Snoeyink J: Speeding up the DouglasPeucker line simplification algorithm. 5th Intl Symp on Spatial Data Handling 1992, 134–143.
Declarations
Research was funded in part by NSF grants: CCF1139864, CCF1136538, and CSI1136536 and by the Richard T. Cheng Endowment, the ODU MSF fund and the ODU startup fund. Nikos Chrisochoides helped defray publication costs via NSF funding.
This article has been published as part of BMC Structural Biology Volume 13 Supplement 1, 2013: Selected articles from the Computational Structural Bioinformatics Workshop 2012. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcstructbiol/supplements/13/S1.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
Andrew McKnight compiled the test set and developed the software for the algorithm. Jing He, as advisor, guided through the problem she has previously researched regarding SSE topology matching. Nikos Chrisochoides and Andrey Chernikov provided technical information and guidance for our algorithm. Dong Si and Kamal Al Nasr provided software and technical support for SSE extraction from and skeletonization of cryoEM density maps.
Rights and permissions
About this article
Published
DOI
Keywords
 Protein Data Bank
 Density Image
 Loop Length
 Initial Path
 Protein Data Bank Structure