Protein structure determination via an efficient geometric build-up algorithm
© Wu et al; licensee BioMed Central Ltd. 2010
Published: 17 May 2010
A protein structure can be determined by solving a so-called distance geometry problem whenever a set of inter-atomic distances is available and sufficient. However, the problem is intractable in general and has proved to be a NP hard problem. An updated geometric build-up algorithm (UGB) has been developed recently that controls numerical errors and is efficient in protein structure determination for cases where only sparse exact distance data is available. In this paper, the UGB method has been improved and revised with aims at solving distance geometry problems more efficiently and effectively.
An efficient algorithm (called the revised updated geometric build-up algorithm (RUGB)) to build up a protein structure from atomic distance data is presented and provides an effective way of determining a protein structure with sparse exact distance data. In the algorithm, the condition to determine an unpositioned atom iteratively is relaxed (when compared with the UGB algorithm) and data structure techniques are used to make the algorithm more efficient and effective. The algorithm is tested on a set of proteins selected randomly from the Protein Structure Database-PDB.
We test a set of proteins selected randomly from the Protein Structure Database-PDB. We show that the numerical errors produced by the new RUGB algorithm are smaller when compared with the errors of the UGB algorithm and that the novel RUGB algorithm has a significantly smaller runtime than the UGB algorithm.
The RUGB algorithm relaxes the condition for updating and incorporates the data structure for accessing neighbours of an atom. The revisions result in an improvement over the UGB algorithm in two important areas: a reduction on the overall runtime and decrease of the numeric error.
Proteins are important bio-molecules in biological systems and activities. A protein is a polypeptide chain made of 20 different types of amino acids. An amino acid sequence determines the structure of the protein. Knowledge of the protein structure gives us insight into function of the protein and its dynamics. Therefore, it is always important to have an accurate protein structure in the highest resolution available. The distances between many pairs of atoms in a protein can often be determined based on our knowledge of chemistry (for example certain types of bond-lengths and bond angles) , or from nuclear magnetic resonance (NMR) experiments . If a sufficiently large set of inter-atomic distances can be obtained, then a protein structure can be determined by solving a so-called molecular distance geometry problem (MDGP) . MDGPs in their most general form are known to be computationally intractable (NP-hard) .
In an experimental setting there are two additional restrictions: First, often only a small subset of all pair-wise distances may be available. Second, instead of a single distance, experiments might only yield a distance range for a pair of atoms (a lower bound and upper bound of a distance). Several algorithms have been developed as solutions or approximate solutions to MDGP. These algorithms include singular value decomposition , the embedding algorithm , the alternative project algorithm , the graph reduction algorithm , the multi-scaling algorithm , and the global optimization algorithm . Many of these algorithms are computationally expensive, in particular if they attempt to solve the MDGP in a general form.
In this paper we will only consider the MDGP in the case when exact distances are available. Furthermore we concentrate on a particular class of algorithms that are computationally quite fast and will often suffice to solve the MDGP problem. These are so called geometric build-up algorithms (GB) . A GB algorithm is based on the idea of iteratively adding one atom at a time to a list of positioned atoms.
Here we will refer to a positioned atom as an atom with known coordinates in 3D space and an unpositioned atom as an atom where we do not know its 3D coordinates. It is well-known in geometry that in 3D an unpositioned point P can be positioned when there exist four positioned non-planar points, each of which has a known distance to P. It is easy to see that when all pair-wise distances between atoms are available, a set of four such atoms can always be found. Such a set of four atoms used to determine another atom is also called a metric base. The algorithm in the case when all distances are known has a linear running time because a metric base is easily found . However, such an ideal situation will hardly ever arise. Clearly in this case the MDGP is not a hard problem. The algorithm needs to be modified to determine a protein structure when only a sparse set of pair-wise distances is available. In such a case, finding a metric base to add an atom to the list of positioned atoms requires more work. The simplest idea would be to exhaustively search through all possible metric bases until one is found that allows the positioning of an atom. Theoretically, when a sparse distance data has sufficiently many entries a protein structure can be determined.
A major problem in the geometric build up procedure is numerical stability when a protein has a large number of atoms. Due to computational round off or truncation, errors are introduced into the build-up coordinates and the iterative nature of the algorithm can cause these errors to accumulate. This problem has been solved by using an updated geometric build-up (UGB) algorithm . The updating reduces the accumulation of numerical error to a tolerable level. The UGB algorithm can solve the MDGP with high accuracy. The idea of the updating procedure is to re-compute the coordinates of the four atoms in the metric base whenever possible using the original, correct pair-wise distances. Therefore, the fresh coordinates of these four atoms, with a minimal numerical error, can be used to determine the unpositioned atom more accurately. The drawback of the UGB algorithm is the additional computational time it requires to select a metric base carefully and the updating procedure itself.
Geometric build-up algorithms
Definition 1.1 A set of points B (with known coordinates) in a space (usually R 3 ) is a metric basis of a set S of points provided the coordinates of each point of S are uniquely determined by its known distances to the points in B.
Definition 1.2 A set of four points in R 3 is called independent if they are not co-planar.
Definition 1.3 A point u i is called a neighbouring point of a point u j if u i has a known distance d i,j from u j .
Theorem 1.1 Given a set of distances among four non-coplanar points, then the coordinates of the four points can be uniquely determined up to a rigid motion that is a combination of a translation, a rotation and possibly a reflection.
Proof. The distances between the four points define a tetrahedron and therefore this is obvious.
Theorem 1.2 If the coordinates of four non-planar atoms x i , i=1,2,3,4 and the distances d i,j, , i=1,2,3,4 to a fifth atom x j are given, then the coordinates of the fifth atom x j can be determined uniquely. In other words, any four independent points in R 3 form a metric basis for R 3 .
Proof. While this theorem is geometrically obvious, we provide a short proof that will give us insight of how the coordinates of the fifth atom are actually computed. Let x i = (u i , v i , w i ) T , i = 1, 2, 3, 4, be the coordinate vectors of the first four atoms and x j = (u j , v j , w j ) T the coordinate vector of the fifth atom. We then have a set of equations,
Square the equations and expand their left-hand-sides to obtain
Subtract the first equation from the rest to reduce the equations to the following three,
Define the matrix A and the vector b by:
We can then write the above equations in the following matrix form.
Since x1, x2, x3, x4 are not in the same plane, the matrix A is nonsingular and therefore, the linear system of equations can be solved to obtain a unique solution for x j . Therefore, any four independent points in R 3 form a metric basis for R 3 .
Note that the above algorithm (given in the proof of Theorem 1.2) shows that the coordinates of x j can be computed in a constant time. Therefore, the geometric build-up algorithm can determine a protein structure in a linear running time when all exact distances are available. Moreover, when all distances are available a single metric base can be used throughout the process because in each iteration as the four required distances will be available. However there is no guarantee that a solution to the MDGP can be found when only sparse distance data is available. If we assume that any initial four atoms will lead to a protein structure using the GB, the algorithm will require a O(n 3 ) running time in a worst case analysis. There are three nested loops in the GB algorithm: A while-loop (while L is not empty, where L is the list of unpositioned atoms), within the while-loop a for-loop (check all remaining atoms in L to find which one can be determined with currently determined atoms), and within the for-loop finding four determined atoms with a distance from a given atom. Each step has in the worst case O(n) many steps. Therefore, the worst case total running time is O(n 3 ).
As shown in previous reports , sparse distance data can produce a large numerical rounding error that must be dealt with. In the case of given sparse distance data, almost always a new different metric base must be used in the determination of a single atom. Thus, the metric bases used in determination of unpositioned atoms contain rounding errors from earlier calculations. Therefore the errors introduced in previous steps accumulate. As a result, the matrix A in the proof of Theorem 1.2 is often not accurate and hence cannot be used to determine new coordinates of atoms accurately. In summary, the GB algorithm produces larger and larger rounding error in the coordinate determination of unpositioned atoms.
An updated geometric build-up algorithm (UGB)
This algorithm incorporates the idea of re-computing the coordinates of the four atoms in a metric base to minimize the rounding error. In many cases, there exist many options to select a metric base of four atoms that can determine an unpositioned atom. In the updated geometric build-up algorithm, four non-coplanar atoms with original distances among them are preferred. The reason is that a metric base forms a tetrahedron T consisting of original distances that allows to position the atoms of the metric base relative to each other with minimal rounding error. The coordinates of the unpositioned atom can now be determined with minimal rounding error relative to the tetrahedron T creating a complex consisting of 5 atoms whose edges form a complete graph K 5 .
There are two major steps in this algorithm. First, the positions of the four base atoms are recomputed based on Theorem 1.1. The new positions of the four base atoms are completely independent of their old positions, and this first step just guarantees that the four base atoms form a tetrahedron where the distances between the atoms as accurate as possible. Second, the translation vector and rotation matrix need to be found for re-initializing. This second step requires techniques used in computation of the Root Mean Square Deviation (RMSD).
We explain the re-initialization step for a tetrahedron when all distances among four atoms are available. Let (x i , y i , z i ) be coordinates of i th atom, i=1,2,3,4, four atoms and let d ij be the distance between i th and j th atoms, i=1,2,3,4. The initialization consists of the following steps. We put the first atom at the origin, the second atom on the x-axis and the third atom into the xy-plane. Then we can determine the position of the fourth atom. The formulas below explain the above steps and a more detailed explanation of the procedure is available in the reference ,
x 1 =0, y 1 =0, z 1 =0
x 2 =d 21 , y 2 =0, z 2 =0
We explain the standard RMSD steps for any two structures of embedded points with coordinate matrices X and Y of an identical set of n points. In our case n=4, the matrix X contains the old coordinates of metric base atoms and the matrix Y contains the recomputed coordinates of the metric base atoms. First, we need to translate these two structures so that their geometric centers are both at the origin. This can be done using the following formulas,
Now, X 1 and Y 1 are the two translated matrices with the same geometric center at the origin. We can then find the rotation matrix Q so that RMSD value of X 1 and Y 1 is minimized. This is formulated as , where Q is a rotation matrix and || ||F is defined by , where is the distance between the two points X i and Y i . Q can be computed through the following steps. Compute C= Y 1 T X 1 ; then let UΣV T =C be the singular value decomposition of C. That Q=UV T can be easily verified to be the solution to the above minimization problem. In the updated geometric build-up algorithm, the above computations will give the translation vectors (xc(1),xc(2),xc(3)) and (yc(1),yc(2),yc(3)) and the rotation matrix Q. Applying this to the recomputed coordinates of four metric base atoms and the newly determined atom, the five atoms can be translated and rotated back to the protein structure. Compared to the general geometric build-up algorithm, in many cases, only the updated geometric build-up algorithm can determine protein structures completely and accurately when a sparse set of distance data is available . However, the algorithm has a drawback that a brute force search for a metric base of four atoms with known distances among them can take up to O(n 4 ) (if one considers all 4 element subsets of n points) and then the total running time can be O(n 6 ). The majority of this worst case running time is spent finding four atoms with all distances among them.
In this paper, the UGB algorithm is improved by a revised updated geometric build-up algorithm (RUGB). This algorithm aims at reducing the computational complexity of the UGB algorithm. As we will show the RUGB algorithm also improves the numerical error over the performance of the UGB algorithm.
A revised updated geometric build-up algorithm (RUGB)
Although the updated geometric build-up algorithm UGB has shown the property of controlling numerical errors, the UGB algorithm requires searching for four atoms with distances among them as a metric base in every iteration. A revised updated geometric build-up algorithm is described in this paper. The algorithm is based on the regular updated geometric build-up algorithm and modified by adding a new data structure and relaxing the condition of a metric base. The first modification in the algorithm is that instead of requiring four metric base atoms with distances among them, this algorithm requires three metric base atoms with distances among them and one additional atom. The purpose of relaxing the condition is to cut down the time it takes to find a new metric base. The updating scheme can still be implemented with only three metric base atoms. However, using three atoms, with all distances among them, will result in two possible sets of coordinates for the position of an undetermined atom. In order to distinguish the correct solution from the incorrect solution we use the distance to a fourth determined atom that is not in coplanar with the first three base atoms. This strategy is also based on Theorem 1.2. The re-initialization and updating of the metric base of three atoms also follows the steps similar to those in UGB algorithm introduced in the previous section. In this case, three atoms rather than four atoms are considered.
The size of d max compared to the total number of atoms n
Theorem 3.1 Assume that any four initial metric base atoms can lead to the complete determination of a protein structure given a sparse set of distance data, then a protein structure can be determined by the revised geometric build-up algorithm (RUGB) using O(n 2 d max 3 ) many steps, where n is the number of atoms and d max is the largest degree of atoms
Proof. For any unpositioned atom A, it will take O(d max 3 ) many steps to know if there exist three neighbouring atoms x 1 , x 2 , x 3 , which have known distances between them. If it is the case, then it will take O(d max ) many steps to know if there is any additional neighbouring atom x 4 of A such that x 1 , x 2 , x 3 , and x 4 are non planar.
If both a metric base of three atoms x 1 , x 2 , x 3 and an additional neighbouring atom x 4 can be found, then apply the updating strategy, which includes re-computing the coordinates of a metric base of three atoms x 1 , x 2 , x 3 , determining the coordinates of the unpositioned atom A, updating coordinates by translation and rotation and using the additional atom x 4 to determine the correct position for A. For any choice of the four atoms x 1 , x 2 , x 3 , and x 4 this can be done in constant time.
Thus for an unpositioned atom A the total running time will be at most O(d max 3 ) regardless if the position of A can be determined at this point. There are at most n unpositioned atoms and in the worst case we have to look at all of them before we can add a single atom. Thus it may take O(nd max 3 ) many steps to add a single atom. Since the size of the initial list L is n-4 initially, the total running time is O(n 2 d max 3 ).
Note that often in NMR structure determination, only distances less than 5Å can be obtained. Therefore, the typical distance matrix is sparse in realistic applications. However, the RUGB algorithm of Figure 3 relies on the successful selection of the initial four metric base atoms. There is no guarantee that choosing any arbitrarily selected metric base for initialization, will result in the algorithm completely determining a protein structure. In such a case we can start over by selecting a different set of atoms for initialization. The following theorem analyzes the upper bound of computational complexity no matter whether a protein structure can be determined or a graph can be realized, using a revised geometric build-up algorithm.
Theorem 3.2 Given a sparse set of distance data for a protein, then it takes at most O(n 3 d max 6 ) to determine whether a protein structure can be solved using a revised geometric build-up algorithm.
Proof. In a protein structure, there are at most O(nd max 3 ) many four atoms that are non co-planar and have distances among them. Any of these sets of four atoms can be considered an initial metric base. However, the worst case is all of them fail until the last one works or none of them work at all. Therefore, the upper bound of running time is O(nd max 3 ) O(n 2 d max 3 )= O(n 3 d max 6 ).
We tested the RUGB algorithm on a set of proteins. We also compared the results with results generated by the GB algorithm and the UGB algorithm. The testing data was prepared in the following way. A set of proteins with their structures were downloaded from the protein structure database PDB . For each protein, a structure file contains the (x,y,z) coordinates corresponding to each atom in the structure and then a distance matrix of all pair wise distances can be generated. In practice, especially in NMR experiments, only distances between two protons less than 5Å are typically available. In our testing we used a cut-off distance of 5 Å and deleted all distances that were larger (if there were any). This resulted in sparse distance data that only contains distances less than 5Å. However, due to the poor performance of general GB algorithm on sparse distance data, we also generated a second matrix using a different cut-off distance of 8Å. For each test case of a protein, we applied the GB, the UGB and the RUGB algorithms. We analyzed results by comparing numerical error and running time for the three algorithms.
The numerical results of RUGB and UGB
RUGB time (s)
UGB time (s)
RUGB error (Å)
UGB error (Å)
This is surprising since the up-date regimes are very similar. The main reason could be the following: The RUGB algorithm uses only three base atoms to numerically determine an unpositioned atom with two solutions and one additional atom to fix the real solution. This up-dating procedure involves less numerical calculation when compared with the 4 atom up-dating routine of the UGB algorithm. So it could be that the RUGB up-dating produces a smaller numerical error.
In Table 2, the structural determination of 1KVX shows unusually larger numerical errors, compared with several other selected proteins that have a similar number of amino acids, such as 1CEU and 1VMP. One reason might be that a triangle selected in the RUGB algorithm leads to a very flat tetrahedron. In this case the positions of four atoms are almost co-planar, and the determination of position of the unknown atom produces a solution of coordinates with a larger error then the error produced by a tetrahedron that is not consisting of four almost coplanar points.
Numerical results of using RUGB and GB methods in protein structure determination
In Table 3, it is easy to see that the updating procedure plays a very important role in controlling numerical errors, see also similar results in . Using a 8Å cut-off distance, the GB algorithm can determine the structure all tested proteins in some sense, however the rounding errors are so large that these structures are no longer useful.
Using a 5Å cut-off distance, the GB algorithm fails in producing a complete protein structure in some instances due to a round-off error that gets out of control. For the 8Å cut-off distance the given set of pair wise distances is much denser. This work verifies that the importance of updating that is used in both the RUGB and the UGB algorithms. Both algorithms indeed can determine a protein structure with a high accuracy.
A very accurate protein structure is essential to understand the function and dynamics of the protein in biological systems and activities. Applications of distance geometry in protein structures determination arise from the fact that pair wise distances of atoms in a protein can often be obtained from experiments or our knowledge of chemistry. Hence a protein structure can be determined if there exists a solution to the distance geometry problem. However, the problem is proved to be NP-complete. GB algorithms do not solve all distance geometry problems. In the cases where they do give a solution, GB algorithms can determine protein structure efficiently and accurately. In the GB algorithm, the positions of atoms are determined iteratively and rely on other already determined positions of atoms, which cause the accumulation of numerical errors. The strategy of updating allows us to control the size of numerical errors. However, in the UBG algorithm updating requires implementing an expensive step that contributes up to O(n 4 ) in the running time and the condition that the four base atoms to be updated must have all their distances known is quite strong. In this paper, the RUGB algorithm relaxes the condition for updating and incorporates the data structure for accessing neighbours of an atom. This results in an improvement of both the overall runtime and the numeric error over the UGB algorithm.
The RUGB algorithm has shown important properties of controlling numerical errors and effectiveness. However, this paper provides only theoretical studies of the method. The practical problems generally have distance ranges in a data set, such as NMR structure determination and protein structure prediction. In the future, we will address the application of RUGB methods in these cases. Also the theoretical results provide the upper bound of run-time when a sparse set of distances is given. More advanced methods should also be Applications of knowledge in graph theory or other advanced data structures may improve the algorithm further and will be a topic of future research.
The authors would like to thank the National Institutes of Health (NIH) and National Center for Research Resources (NCRR) Grant P20 RR16481 (Kentucky Biomedical Research Infrastructure Network) and National Science Foundation (NSF) Kentucky EPSCoR Research Enhancement Grant (REG) for support.
This article has been published as part of BMC Structural Biology Volume 10 Supplement 1, 2010: Selected articles from the Computational Structural Bioinformatics Workshop 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1472-6807/10?issue=S1.
- Creighton TE: Proteins: Structures and Molecular Properties. 2nd edition. Freeman Company; 1993.Google Scholar
- Wuthrich K: NMR of Proteins and Nucleic Acids. Wiley; 1986.Google Scholar
- Crippen GM, Havel TF: Distance Geometry and Molecular Conformation. John Wiley & Sons; 1988.Google Scholar
- Saxe JB: Embeddability of Weighted Graphs in k-space is Strongly NP-hard. In Proceedings of the 17th Allerton Conf. on Communication, Control and Computing: Oct. 1979; University of Illinois. Edited by: Jose B. Cruz, Jr. University of Illinois; 1979:480–489.Google Scholar
- Glunt W, Hayden TL, Raydsan M: Molecular Conformations from Distance Matrices. J Comput Chem 1993, 14: 114–120. 10.1002/jcc.540140115View ArticleGoogle Scholar
- Hendrickson BA: The Molecular Problem: Determining Conformation from Pairwise Distances. In Ph.D, thesis. Cornell University, Computer Science Department; 1991.Google Scholar
- Torgerson WS: Theory and Applications of Distance Geometry. Oxford Clarendon Press; 1953.Google Scholar
- More J, Wu Z: Global Continuation for Distance Geometry Problems. SIAM Journal of Optim 1997, 3: 814–836. 10.1137/S1052623495283024View ArticleGoogle Scholar
- More J, Wu Z: Distance Geometry Optimization for Protein Structures. J of Global Optim 1999, 15: 219–234. 10.1023/A:1008380219900View ArticleGoogle Scholar
- Dong Q, Wu Z: A linear-time algorithm for solving the molecular distance geometry problem with exact inter-atomic distances. J Global Optim 2002, 22: 365–375. 10.1023/A:1013857218127View ArticleGoogle Scholar
- Wu D, Wu Z: An Updated Geometric Build-up Algorithm for solving the Molecular Distance Geometry Problem with Sparse Distance Data. J Global Optim 2007, 37: 661–673. 10.1007/s10898-006-9080-6View ArticleGoogle Scholar
- Dong Q, Wu Z: A geometric build-up algorithm for solving the molecular distance geometry problem with sparse distance data. J Global Optim 2003, 26: 321–333. 10.1023/A:1023221624213View ArticleGoogle Scholar
- Wu D, Wu Z, Yuan Y: The Solution of the Distance Geometry Problem in Protein Modeling via Geometric Build-Up. Biophy Rev and Lett 2008, 3: 43–75. 10.1142/S1793048008000617View ArticleGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov LN, Bourne PE: The protein data bank. Nucleic Acids Res 2000, 28: 235–242. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.