- Research article
- Open Access
A comprehensive analysis of 40 blind protein structure predictions
BMC Structural Biology volume 2, Article number: 3 (2002)
We thoroughly analyse the results of 40 blind predictions for which an experimental answer was made available at the fourth meeting on the critical assessment of protein structure methods (CASP4). Using our comparative modelling and fold recognition methodologies, we made 29 predictions for targets that had sequence identities ranging from 50% to 10% to the nearest related protein with known structure. Using our ab initio methodologies, we made eleven predictions for targets that had no detectable sequence relationships.
For 23 of these proteins, we produced models ranging from 1.0 to 6.0 Å root mean square deviation (RMSD) for the Cα atoms between the model and the corresponding experimental structure for all or large parts of the protein, with model accuracies scaling fairly linearly with respect to sequence identity (i.e., the higher the sequence identity, the better the prediction). We produced nine models with accuracies ranging from 4.0 to 6.0 Å Cα RMSD for 60–100 residue proteins (or large fragments of a protein), with a prediction accuracy of 4.0 Å Cα RMSD for residues 1–80 for T110/rbfa.
The areas of protein structure prediction that work well, and areas that need improvement, are discernable by examining how our methods have performed over the past four CASP experiments. These results have implications for modelling the structure of all tractable proteins encoded by the genome of an organism.
The state of blind protein structure prediction
The community-wide experiment on methods to test protein structure prediction (CASP) was first initiated in 1994, as a means of evaluating structure prediction methods in a blind and rigourous manner . This was motivated in part by claims in the literature of the protein folding problem being "solved" without producing tangible benefits, since most of the "solutions" included a strong dependence on the test set. These experiments evaluate prediction techniques by asking modellers to construct models for a number of protein sequences before the experimental result is known, over a period of 3–4 months. We have taken part in all four CASP experiments, including the most recent one (CASP4) that finished in December 2000 http://predictioncenter.llnl.gov. The CASP4 results provide a benchmark as to what level of model accuracy we can currently expect from our approaches.
There are three primary categories of methods for predicting protein structure from sequence: comparative modelling, fold recognition, and ab initio prediction. In the comparative modelling and fold recognition categories, the methodologies rely on the presence of one or more evolutionarily related template protein structures that are used to construct a model; the evolutionary relationship can be deduced from sequence similarity [2–5] or by "threading" a sequence against a library of structures and selecting the best match [6–8]. For both of these approaches, a sequence alignment between the target protein to be modelled and the evolutionarily related protein with known structure is used to create the initial or seed model. In the ab initio category, there is no strong dependence on database information and prediction methods are based on general principles that govern protein structure and energetics [9–13]. The categories vary in difficulty, and consequently methods in each of these categories produce models with different levels of accuracy relative to the experimental structures.
Since the inception of CASP, predictors all over the world have built models for 128 proteins using the methodologies described above. Before the first CASP experiment, published results in comparative modelling in the literature usually were obtained by applying structure prediction methods in the context of the exact experimental structure; for example, re-building side chains on the native main chain, or re-building regions of main chain keeping the rest of the experimental structure fixed. (This practice continues to this day.) Ab initio methodologies, parameterised extensively on small test sets, failed when given novel types of sequences.
CASP1 was an eye-opener in terms of understanding the difficulty of making accurate predictions on approximate templates in comparative modelling . The main problems in creating a good comparative or fold recognition model were related to alignment between the template and target sequences, and building of non-conserved variable regions (side chains, and main chain loops). In the ab initio category, it appeared that methods could not sample conformational space accurately, and select native-like conformations, for all but very small fragments. The highlight of CASP1 was the recognition of threading as viable method for predicting folds , and the success of neural-network based secondary structure prediction methods .
The second CASP showed some improvement in two areas: In comparative modelling, loops were being built better, and the use of hand-inspected alignments greatly increased model accuracy . In the fold recognition category, alignments as well as prediction of folds improved . The results in the ab initio category remained virtually unchanged except for one model of the alpha-helical protein NK-Lysin predicted to within 6.0 Å Cα RMSD, capturing the correct topology .
The third CASP saw consistent but little progress relative to CASP2 in both comparative modelling and fold recognition categories [20, 21], but the most dramatic results were observed in the ab initio category. Here multiple groups predicted large parts of a few protein sequences at a crude topological level (within 6.0 Å Cα RMSD for ≈ 60 residues) .
Performance of our methods at CASP1-3: Comparative modelling
Even though our methods produced mediocre results at CASP1, we realised that a major problem with accurate comparative modelling had to do with the interconnected nature of protein structures : If a certain region of the protein varied with respect to the homologue, then it was likely that a structurally interacting region would also vary, even if that region was conserved in sequence. We therefore developed a graph-theory based approach to address this problem which demonstrated significant progress in loop building at CASP2 (Table 1) . The CASP3 and CASP4 results are minor improvements over the CASP2 results since the enhancements made to the graph theory method have been minimal.
Performance of methods at CASP1-3: Ab initio prediction
In the ab initio category, as with comparative modelling, the first CASP experiments did not live up to the results previously published in the literature [16, 19]. It was not until CASP3 (the first time where we took part in this category) that the first consistent positive results were seen: several groups were able to predict the correct topologies for small proteins, or large fragments of a protein (~60–80 residues to about 6.0 Å Cα RMSD relative to the experimental conformation) [22, 25].
The fourth CASP was held in December 2000. CASP4 demonstrated further improvement in our methodologies in both comparative modelling and ab initio prediction . Our approaches combine a Monte Carlo procedure, simulated annealing, genetic algorithms, graph theory, and semi-exhaustive searches with move sets consisting of fragments and discrete state models, and scoring functions consisting of all atom-based pairwise functions, hydrophobicity indices, secondary structure preferences, and hydrogen bonding. The goal was to develop components that form a structure prediction engine, combining and innovating upon previously developed approaches by observing what methods work well at the previous CASP experiments, and adding new components of our own.
We focus here on the results of our prediction methodologies on all of the 40 sequences, for which an experimental answer was later available. Unlike the assessors' evaluations at CASP (which has recently appeared in the special issue (S5) of Proteins: Structure, Function, and Genetics), which focus on how well a particular group performs, we treat CASP as a test-bed for how well an individual method performs. Using the lessons learnt from CASP successes and failures, we suggest a unified approach that mixes and matches between the best predictions to produce the best results. The aim of this work is to illustrate when our prediction methods work, when they fail, and what this means in the context of building models for all proteins encoded by the genome of an organism at the present time.
Results and Discussion
We present a comprehensive analysis of all 40 blind predictions, for which an experimental answer was later available, that were made for CASP4 using a barrage of different but related techniques. We discuss what went right, what went wrong, what further improvements can be made to the methodologies, and the implications of these results for modelling the structure of all tractable proteins encoded by the genome of an organism.
What went right; what went wrong
The CASP4 results show that within each of the general structure prediction categories, some methods, including ours, are able to produce models with a fair amount of accuracy (quantified in the sections below). Further improvements are necessary to overcome the limits of current approaches.
Comparative modelling and fold recognition
Table 2 compares all the predictions we made for CASP4 using comparative modelling and fold recognition methods. The results are qualitatively assessed as being one of "excellent", "good", "useful", and "failure". In the comparative modelling category, we made 29 predictions for targets that had sequence identities ranging from 50% to 10% to the nearest related protein with known structure. For 23 of these proteins, we produced models ranging from 1.0 to 6.0 Å root mean square deviation (RMSD) for the Cα atoms between the model and the corresponding experimental structure for all or large parts of the protein, with model accuracies scaling fairly linearly with respect to sequence identity (i.e., the higher the sequence identity, the better the prediction). These 23 proteins ranged in accuracy from "excellent" to "useful". Figure 1 shows some examples of the comparative modelling predictions with different difficulties made at CASP4.
The comparative modelling and fold recognition targets are in Table 2 are sorted by the difficulty index. The percentage identities for alignments between several comparative modelling targets and their corresponding templates fall in the twilight zone or below (alignments with percentage identities <= 30%). In fact, such targets belong more in the category of fold recognition since it is clear that even a 20% identity alignment can easily result in a wrong fold assignment. (The percentage identity is used for illustration purposes only–BLAST e-values follow a similar trend but are most robust.)
Our comparative modelling methods produce excellent models when the percentage identity between the target and corresponding template sequence is high (usually within 2.0 Å Cα RMSD for > 30% identity). In several cases where the alignment falls into the twilight zone (20–30% sequence identity), models around 4.0 Å Cα RMSD are produced (T0122/trpa, T0112/dhso, T0125/spl8, T0121/malk).
In one case, T0092/yeco, the percentage identity between the target and template proteins in the alignment we used was 12%, but we predict 107 residues to within 6.0 Å Cα RMSD. However, not all cases where we assumed a homology relationship provided similar results, and the failures are indicated as "F" in Table 2.
While the graph-theory methods have been fairly successful at handling the interconnectedness problem to build non-conserved side chains and main chains , other major problems preventing the construction of accurate comparative models have to do with inaccurate alignments and using the template structure as a static model upon which to build variable main chains. In the former case, if a region of the alignment is incorrect but is assumed to be correct, then no amount of further model building will fix this error. In the latter case, the loop and side chain construction methods, even if interconnectedness is taken into account, are limited by the approximate nature of the template framework. In other words, alignment errors are irrecoverable. Even though 50–70% of the regions (of up to 15 residues) we thought would vary with respect to the parent homologue structure were predicted to within 3.0 Å Cα RMSD, this is mostly in cases where the approximate template is well-predicted (within 2.0 Å Cα RMSD).
Ab initio prediction
Table 2 compares all the predictions we made for CASP4 using our ab initio methods. We made eleven predictions for targets that had no detectable sequence relationships when we began the modelling process. We produced nine models with accuracies ranging from 4.0 to 6.0 Å Cα RMSD for 60–100 residue proteins (or large fragments of a protein). Figure 2 illustrates some of our more successful predictions.
At CASP4, we were consistently able to predict 60–80 residue consecutive fragments to within 6.0 Å, and, at times, to within 4.0 Å Cα RMSD. These results are much more consistent than at CASP3, and are also of better quality.
While these predictions are a significant improvement compared to the previous CASP results, we still have to make much progress before we can produce models rivalling that of experiment in accuracy. Given the range of RMSDs for the population of conformations sampled (i.e., "decoys") for each of the proteins (average range for the eleven predictions was 9.3 – 17.6 Å Cα RMSD for the entire protein; and 5.0 – 12.6 Å Cα RMSD when only the best fragments are considered), it is clear that devising representations that will allow us to explore protein conformational space such that near-native conformations are encountered is a major bottleneck. Our filter-based scoring function approach generally picks conformations from the lower end of the RMSD distribution (usually within the top 1%, and no worse than the 10%, of the conformations sampled), but further improvements can be made.
Caveats regarding the use of results from CASP experiments
Averaging over different methods and contexts
The results provided by the CASP organisers and assessors show how well a particular group did, but do not measure performance of individual methods in separate contexts. This makes it harder to determine which methods work well and places an inherent penalty on trying different non-conservative approaches. For example, even successful loop and side chain building methods will fail on comparative models based on incorrect alignments (in our case, we tried six different approaches in the three categories combined, the results for only two of which are listed in Figures 1 and 2). This problem has been alleviated to some degree by the CAFASP experiment , which provides a strict method-by-method automatic evaluation, but it requires that models be prepared by the means of an automated server in a relatively short time-frame. Ranking results by methods used (based on keywords provided when the model is submitted, which could be standardised), and considering subsets of the target relevant to particular methods, would help significantly in identifying the methods that work best.
Subjective quality of evaluations
Once a certain evaluation measure is chosen, then evaluating all models submitted by that measure is objective. However, particular methods appear to perform better depending on the choice of evaluation criteria used (for example, Cα RMSD over a contiguous set of residues, which we prefer, vs. Cα RMSD over non-contiguous residues). This illustrates the need for more than one measure, but even with that taken into account, there exists an inherent subjectivity in measurement, especially given the assessor's visual evaluation of the models during the CASP experiment (one of the authors of the paper, M.L., was an assessor at CASP2). The reason there is a problem is because the results are not entirely clear (i.e., the problem has not been solved). Until predictions with accuracies rivalling that of experiment are made, assessment of predictions must be done automatically using limited and stringent criteria, most relevant to biologists interested in function. Such a criteria could include, for example, how well the model picks out structurally similar proteins from the database of known structures, relative to the experimental result.
What can be done
Promising future areas
While the CASP experiments provide for an environment where rapid testing of ideas is possible in a rigourous manner, a lot of the development is ad hoc, guided by intuition, and not all parameter choices are explored thoroughly.
The CASP experiments also show that there is not one single algorithm that can "solve" the protein structure prediction problem. The most successful methods are those that combine and build upon the techniques developed by several researchers in the last thirty years (special issues of Proteins: Structure, Function, Genetics, 1995, 1997, 1999, and 2002). Generally the methods have incorporated different sampling techniques and a variety of scoring functions each of which aids prediction of structure only to a limited degree when used individually, but are producing models useful for further biological study when combined together in a coherent manner.
To provide a guidance for future work, we analysed some of the more promising paths that we discovered to assess their viability in improving our methods and making better predictions, focusing on four major areas: alignment, refinement, sampling, and selection. An analysis of the results generated by our methods at the next CASP (evaluated in December 2002) will provide a measure of the effectiveness of these improvements.
Comparative modelling and fold recognition: Alignment and template selection using all-atom scoring functions
A major reason for alignment methods failing at CASP has to do with using sequence information only and not incorporating structural information. For example, while modelling T24/ubc9, sequence alignments generated by several methods have an alignment error relative to the structural alignment . The sequence identity/similarity scores would have been lower with the new alignment since the number of identical residues decreases by six in a region of fourteen residues. This phenomenon has been observed time and again at CASP, illustrated in Figure 3 by three examples, including T24/ucb9. We were later able to readily distinguish between the correct and incorrect alignments when an all-atom scoring function was applied to the models constructed using both alignments, and justify the changes by detailed environment analysis. The score for the models based on the correct alignments were better by ~10% on average relative to the model with the original alignment. This would indicate that a sequence alignment algorithm that incorporates structural information in a rigourous manner is useful and necessary to handle the alignment problem.
Historically, in comparative modelling, the template with the highest sequence identity or similarity to the target sequence being modelled has been used for further analysis. However, a comparison of members of a family with known structures shows that sequence only measures do not correlate absolutely with the structural similarity  even in cases where the evolutionary relationships are obvious.
We thus devised an experiment where we constructed models for protein families with large numbers of known structures (specifically the globin and the immunoglobulin families). We then conducted an all-against-all homology modelling exercise where every member of the family was modelled on every other template (resulting in 29 and 60 models for each member of the globin and immunoglobulin families respectively). We compared the performance of the all-atom scoring function to two sequence only metrics. The results for the globin family are given in Figure 4. On average, using the all-atom function improves model quality by 0.8 Å Cα RMSD compared to only using sequence identity. The theoretical best improvement that could have been achieved on average is 0.9 Å Cα RMSD. Similar improvements are observed for the immunoglobulin family.
Taken together with previously published results [32, 49], these results strongly indicate that the all-atom scoring function is a powerful method to handle the alignment problem, the template selection problem, the construction of side chains and main chains, and potentially helpful in refining models when continuous forms of the function are used.
Ab initio prediction: Sampling conformational space
At CASP4, we mixed and matched different move sets and search methods for sampling protein conformational space. Since we did not have the time to test the performance of each move set or search method, we assumed they would work equally well on average and combined them sequentially which generally resulted in improvements.
Table 4 shows the average results of different combinations of move sets and search methods for a set of six proteins (PDB codes: 1ctf, 1e68, 1eh2, 1nkl, 1pgb, 1sro; four of these were CASP targets). The results shown are for 10,000 trajectories with different starting random seeds. While some of the combinations do not necessarily enhance the simple approach of using only 3-residue fragments with a straight-forward monte carlo procedure, the combination of using fragments and the 14-state model for making moves, with MC and GA search techniques for the sampling, shows a significant improvement, which we hope to demonstrate at CASP5 by further extending the preliminary studies described here. Since these combinations were tried with equal weighting, further improvement may be obtained by parameterising how the different move sets and search techniques are applied depending on the trajectory landscape.
Ab initio prediction: Selecting native-like conformations
Even though our all-atom function is readily able to distinguish native-like conformations in certain scenarios, it is not adequate for large sets of decoys where the closest conformation generated is represented at the topological level (≈ 6.0 Å Cα RMSD relative to the experimental result). Using the all-atom function alone to select native-like conformations is not likely to suffice when it is also used in the actual minimisation/search process, since all conformations generated in such searches represent local minima of this function. Thus, our method has incorporates multiple functions and uses hierarchical filtering to reduce the number of conformations from a large sample to a tiny fraction to enhance the signal and eliminate false positives.
At CASP4, we used our expertise to manually devise a single hierarchical filtering scheme where we successively eliminated 10% of the conformations with each filter until we were left with one conformation. In the experiment in Table 5, we compare the average performance of each of the individual filters to our final hierarchical combination when reducing the 10,000 conformations generated for each protein by our search method (corresponding to the last entry in Table 4) to 1000 conformations. The hierarchical combination first reduces the 10,000 conformations to 8000 by applying the density function, which is then reduced to 6000 by applying the hydrophobic compactness function, which is then reduced to 4000, 3000, 2000, and 1000 in the same order as presented in Table 5.
Table 5 shows that particularly promising filters include the use of density-based scoring functions, hydrophobic compactness, all-atom pairwise preferences and match of the final conformation to the predicted secondary structure. Physics-based functions based on electrostatics and van der Waals interactions do not discriminate well on their own, and only do so when an explicit solvation term is added to the functions.
Table 5 also shows that even though some of the individual functions perform well, the combination of all the functions applied in a hierarchical manner performs the best. As mentioned earlier, this combination was developed through intuition under pressure from the CASP experiment (though here the goal was to reduce the total number of conformations to five). This suggests that there exists more optimal (linear and non-linear) combinations of these functions.
Table 6 lists the times taken for the computational tasks outlined in this paper. Times are given per 1000 MHz Pentium III processor and for a cluster of 64 such processors when the algorithm can run in parallel. For CASP4, predictions were made with computing power 1/4th of the capability shown.
Application of structure prediction methods to whole genomes
The qualitative assessment of our methods, considered independently of the difficulty of the prediction, ranks 32/40 models as useful, good, or excellent. Similar results are likely to be observed when these methods are applied to large numbers of sequences if we assume that the sample of 40 proteins roughly reflects the distribution of proteins seen in a genome. In practice, it is likely that we will encounter more homologous proteins in a genome since experimentalists are not as likely to solve a structure for which there clearly exists a homolog.
This is a long way from our predictions at CASP1 , and our initial implementations of these methodologies [12, 24]. Yet there is much room for further improvement. Besides improving the existing methodologies, and developing new ones, we can also integrate other existing algorithms such that consensus predictions can be used to assign confidence levels, as well as having multiple choices for an outcome that can be tested experimentally.
Analyses of small genomes show that about 30–40% of the proteins within the genome can be modelled by comparative modelling and fold recognition methods [27, 50, 52]. An additional 20–30% of the sequences are (or contain) small domains with simple secondary structures that are viable candidates for ab initio structure prediction . The remaining proteins are usually not amenable to structure prediction and sometimes even structure determination (a significant fraction of the latter are membrane proteins).
It is thus possible to construct a "genome prediction engine" using the computational resources available where we can take the protein sequences encoded by an organism's genome and attempt to predict their structures, and use the modelled structures to predict functions. The goal of this endeavour is to improve existing methods and develop new ones to perform various facets of the genome/proteome modelling task in an automated fashion. To this end, our predictions for the next CASP are almost entirely focused on the fully-automated (CAFASP) aspect via the use of a prediction server http://protinfo.compbio.washington.edu
Using predicted structures to annotate function
The reason for obtaining structures for proteins encoded by a genome is that they can be used to understand function and further our knowledge about the organism's biology. Even though structure prediction methods need further development, it is possible to produce models where functional hypotheses can be tested in a rational manner (for example, with mutagenesis experiments) through detailed analysis . Additionally, structure comparisons can be used to detect functional relationships that cannot be detected by sequence information alone , and micro-environment analyses that parse models for particular three-dimensional motifs  can be used to discern molecular function. Both of these structure-based approaches, used complementarily in conjunction with sequence-only motif-finding approaches [56–58] and experimental data, will enable to us better assign function to all or large parts of a proteome.
Why is protein/proteome modelling important?
Even given the ongoing structural genomics projects, the continually increasing amount of DNA and protein sequence data from genome projects makes it infeasible for NMR and x-ray crystallography techniques to rapidly provide information about the 3D structures of the sequences determined . Thus there is an urgent need for reliably predicting structure from amino acid sequence.
Proteins in a cell do not work in isolation of one another. Thus to understand the function of multi-protein complexes, or whole proteomes, from a structural viewpoint, it is necessary to have a model for many proteins encoded by the genome of an organism. The CASP results indicate that structure prediction methods have matured to a point where they can be applied on a genome-wide scale, and that these structures can be used with novel but straight-forward approaches to annotate and understand function [54, 55, 60, 61]. The resulting models and annotations, when combined with other genomic/proteomic data, including that from gene expression arrays , genome-wide two-hybrid experiments , and other proteomics studies , will provide us with a dynamic picture of organismal structure, function, and evolution .
In this section, we describe the procedures we used for making predictions at CASP4. The techniques described are divided based on the major structure prediction categories, but methods developed for application in one category are useful in the other.
Comparative modelling and fold recognition
The same procedure was used for comparative modelling and fold recognition targets. Protein sequences determined to be evolutionarily related to sequences with known structure were modelled using comparative modelling techniques developed by us. We used a combination of methodologies grouped together as shown in Figure 5. Our primary focus was on improving alignment and template selection techniques, and developing methods for moving an approximate conformation closer to the native structure. Additionally, the lessons we learnt from application of our ab initio methodologies were incorporated to better construct non-conserved side chains and main chains.
Template selection and alignment
Target sequences related to proteins that have conformations determined by experiment (X-ray crystallography or NMR) were candidates for comparative modelling. If the sequence relationship between the template and target proteins was unambiguous (usually when the sequence identity is > 40%), or if there was only one protein with known structure in the family, the template structure was used to construct the sole initial model. If there were many possible template structures, models were constructed using all available templates.
PSIBLAST alignments and other publicly available servers such as GenTHREADER  and SAM-T99 , available as part of the CAFASP meta-server , were also used to generate a variety of choices for alignments. These alternate alignments were used to construct initial models. Thus, for a given protein in a family with at least one known representative structure, there could be many template and alignment choices for constructing the initial models.
Constructing initial models
Following the sequence alignment, an initial model was generated for each template structure and corresponding alignment by copying atomic coordinates for the main chain (excluding any insertions/loops) and for the side chains of identical residues in the target and template proteins. Residues that differed in side chain type were constructed using a minimum perturbation (MP) technique . The MP method changes a given amino acid to the target amino acid preserving the values of equivalent torsion angles between the two side chains, where available. The other angles were constructed for each residue type using internally developed library based on the most frequently observed χ values in a database of known structures .
Manual inspection to improve alignments
An all-against-all structure comparison between all the initial models was used to produce a multiple sequence alignment based on structural similarity for a given family. This alignment was used in conjunction with sequence information and interactive graphics to create new sequence alignments.
Constructing variable side chains and main chains
Multiple side chain conformations for residue positions that differ in type between the template and target proteins were generated by exploring all the possibilities in a rotamer library . The most probable conformations based on the interactions of a given conformation with the local main chain were selected using an all-atom distance dependent conditional probability discriminatory function .
A set of possible conformations were generated for main chain regions (loops) considered to vary in the target with respect to the template structures, including insertions and deletions. Main chain sampling was performed using an exhaustive enumeration technique based on 14 discrete torsion angle states . For longer main chain regions (> 15 residues), fragments from a database of protein structures are used to generate the torsion angle values. Developments in our ab initio sampling protocol were incorporated into our loop sampling technique.
In CASP experiments, main chain regions and side chains selected for sampling were determined visually using interactive computer graphics. We partially automated this procedure by developing programs to identify side chains with implausible packing, clashes, and unfavorable electrostatic interactions with other side chains and/or main chain.
All-atom conditional probability scoring function
The all-atom scoring function is the core of many aspects of this project where identification of native-like conformations is required. The function calculates the probability of a conformation being native-like given a set of inter-atomic distances . The conditional probabilities are compiled by counting frequencies of distances between pairs of atom types in a database of protein structures. All non-hydrogen atoms are considered, and a residue-specific description of the atoms is used, i.e., the Cα of an alanine is different from the Cα of a giycine. This results in a total of 167 atom types. The distances observed are divided into 1.0 Å bins ranging from 3.0 Å to 20.0 Å. Contacts between atom types in the 0–3 Å range are placed in a separate bin, resulting in a total of 18 distance bins. Distances between atoms within a single residue are not included in the counts.
We then compile tables of scores proportional to the negative log conditional probability that one is observing a native conformation given an interatomic distance for all possible pairs of the 167 atom types for the 18 distance ranges. Given a set of distances in a conformation, the probability that the conformation represents a "correct" fold is evaluated by summing the scores for all distances and the corresponding atom pairs.
Using graph theory to generate consistent conformations
We use a graph-theoretic approach to assemble the sampled side chain and main chain conformations together in a consistent and optimal manner: Each possible conformation of a residue is represented using the notion of a node in a graph. Each node is given a weight based on the degree of the interaction between its side chain atoms and the local main chain atoms. The weight is computed using the all-atom scoring function . Edges are then drawn between pairs of residues/nodes that are consistent with each other (i.e., clash-free and satisfying geometrical constraints). The edges are also weighted according to the probability of the interaction between atoms in the two residues. Once the entire graph is constructed, all the maximal sets of completely connected nodes (cliques) are found using a clique-finding algorithm . The cliques with the best total weights represent the optimal combinations of mixing and matching among the various possibilities, taking the respective environments into account . The clique-finding approach for generating conformations is fast, since it pre-calculates all the scores. In its present implementation, it can sample up to 1011 conformations in a 24-hour period on a 1000 MHz Intel processor.
Selecting the most native-like conformations
All models produced are refined using the Energy Calculation and Dynamics (ENCAD) package . For a given protein sequence, there could be more than one all-atom model produced. For such cases, all models were ranked using the all-atom pairwise scoring function  and the best scoring models are considered to be the most native-like ones.
Ab initio prediction
Target sequences without known homologues or analogues that were small in size and/or predicted to have largely helical content were modelled by our ab initio protocol. Such sequences can be subsequences of larger proteins, in which case they most likely represent domain boundaries .
Our general paradigm for predicting structure involves sampling the conformational space (or generating "decoys") such that native-like conformations are observed, and then selecting them using a hierarchical filtering technique with many different scoring functions (Figure 6). The two parts to our method are designed to be completely automated and readily extendable to application for hundreds or thousands of sequences. Generally, we explore combinations of different representations/move sets with two search methods for exploring protein conformational space, and combinations of a variety of scoring function "filters" to identify biologically relevant conformations.
Sampling protein conformational space
We initially start with an all-atom conformation where the torsion values for residues predicted to be in helix/sheet by PSIPRED secondary structure prediction  are set to idealised values . The remaining φ/ψ values are set in an extended conformation. Side chain conformations are predicted by simply using the most frequently observed rotamer in a database of protein structures . New conformations are generated by perturbing the existing conformation at an arbitrary residue by one of two methods: (i) the torsion values for three residues with identical sequence from a known structure are used to modify the current conformation, similar in spirit to that of Baker and colleagues ; (ii) one of possible 14 torsion (φ/ψ) values derived based on the most frequently occurring torsion values for a given residue in a database of known structures. The move sets were combined sequentially (i.e., where a certain number of iterations consisted of copying torsion values for 3-residue fragments and the next few iterations would use torsion values from the 14-state φ/ψ model).
The scoring function for minimisation is primarily a combination of the all-atom function, a hydrophobic compactness function, and a bad contacts function . The primary search technique we used was a Metropolis Monte Carlo (MC) procedure where conformations are accepted or rejected based on the Boltzmann's equation . Each trajectory was allowed 50,000 iterations, starting with a high temperature such that 99% of the moves were accepted for the first 1000 steps and "cooled" linearly until only 1% of the moves were accepted for the last 100 steps.
At particular points in the trajectory (every 1000 steps), a fragment from another trajectory was copied at random, similar in spirit to genetic algorithms (GA) strategies [42, 43]. The standard Metropolis criterion would then apply: i.e., if the selected fragment enhanced the score of the conformation relative to the previous conformation, it would be accepted. If not, its probability of acceptance would be determined by the difference in score between the current conformation and the previous conformation. The probability is calculated by the Boltzmann-like equation where ΔE is the difference in scores and kT, representing the product of Boltzmann's constant and temperature, is set to a value calculated using the standard deviation of the scores of the first 1000 steps in a given trajectory.
Selecting biologically relevant conformations
The conformations generated were minimised using ENCAD  and scored using a combination of scoring functions that hierarchically reduces the total number of conformations produced to one final conformation. The scoring functions used for the final filtering include the all-atom function , hydrophobic compactness , a simple residue-residue contact function , a density-scoring function that is based on the distance of a conformation to all its relatives in the conformation pool, contact order , a secondary structure based scoring function that evaluates the match between the predicted structure and the secondary structure of a final energy-minimised conformation, and standard physics-based electrostatics and Van der Waals terms .
Internal testing and comparison of models to the experimental result
We initially ran our algorithms on test sets consisting of 10–20 proteins. To minimise bias of a particular algorithm to a fixed test set, new proteins were added to the test sets regularly. In all cases where a three-dimensional model must be compared to an experimental structure, we use the root mean square deviation (RMSD) between corresponding atoms of the prediction and the experimental answer (usually calculated using the Cα atoms).
Availability of software and decoys
The ensembles of structures that were generated and much of the software used to generate them are available at http://compbio.washington.edu
Moult J, Hubbard T, Fidelis K, Pedersen J: Critical Assessment of Methods of Protein Structure Prediction (CASP): Round III. Proteins 1999, S3: 2–6. Publisher Full Text 10.1002/(SICI)1097-0134(1999)37:3+<2::AID-PROT2>3.3.CO;2-U
Doolittle R: Similar amino acid sequences: chance or common ancestry? Science 1981, 214: 149–159.
Greer J: Comparative modeling methods: application to the family of the mammalian serine proteases. Proteins 1990, 7: 317–334.
Sander C, Schneider R: Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 1991, 9: 56–68.
Murzin A, Bateman A: Distant homology recognition using structural classification of proteins. Proteins 1997, 29S: 105–112. 10.1002/(SICI)1097-0134(1997)1+<105::AID-PROT14>3.3.CO;2-1
Bowie J, Lüthy R, Eisenberg D: Method to identify protein sequences that fold into a known three-dimensional structure. Science 1991, 253: 164–170.
Jones D, Taylor W, Thornton J: A new approach to protein fold recognition. Nature 1992, 358: 86–89. 10.1038/358086a0
Flöckner H, Domingues F, Sippl M: Protein folds from pair interactions: a blind test in fold recognition. Proteins 1997, S1: 129–133. 10.1002/(SICI)1097-0134(1997)1+<129::AID-PROT17>3.3.CO;2-F
Lee J, Liwo A, Ripoll D, Pillardy J, Scheraga J: Calculation of protein conformation by global optimization of a potential energy function. Proteins 1999, S3: 204–208. 10.1002/(SICI)1097-0134(1999)37:3+<204::AID-PROT26>3.3.CO;2-6
Ortiz A, Kolinkski A, Rotkiewicz P, Ilkowski B, Skolnick J: Ab initio folding of proteins using restraints derived from evolutionary information. Proteins 1999, S3: 177–185. 10.1002/(SICI)1097-0134(1999)37:3+<177::AID-PROT22>3.3.CO;2-5
Osguthorpe D: Improved ab initio predictions with a simplified, flexible geometry model. Proteins 1999, S3: 186–193. 10.1002/(SICI)1097-0134(1999)37:3+<186::AID-PROT23>3.3.CO;2-4
Samudrala R, Xia Y, Huang E, Levitt M: Ab initio protein structure prediction using a combined hierarchical approach. Proteins: Struct Fund Genet 1999, S3: 194–198. 10.1002/(SICI)1097-0134(1999)37:3+<194::AID-PROT24>3.3.CO;2-6
Simons K, Bonneau R, Ruczinski I, Baker D: Ab initio structure prediction of CASPIII targets using ROSETTA. Proteins 1999, S3: 171–176. 10.1002/(SICI)1097-0134(1999)37:3+<171::AID-PROT21>3.3.CO;2-Q
Mosimann S, Meleshko R, James M: A critical assessment of comparative molecular modeling of tertiary structures in proteins. Proteins 1995, 23: 301–317.
Lemer C.M.-R, Rooman M, Wodak S: Protein structure prediction by threading methods: evaluation of current techniques. Proteins: Struct Funct Genet 1995, 23: 337–355.
Defay T, Cohen F: Evaluation of current techniques for ab initio protein structure prediction. Proteins: Struct Funct Genet 1995, 23: 431–445.
Martin AC, MacArthur M, Thornton J: Assessment of comparative modelling in CASP2. Proteins 1997, S1: 14–28. 10.1002/(SICI)1097-0134(1997)1+<14::AID-PROT4>3.3.CO;2-F
Levitt M: Competitive assessment of protein fold recognition and threading accuracy. Proteins 1997, 92–104. 10.1002/(SICI)1097-0134(1997)1+<92::AID-PROT13>3.3.CO;2-1
Lesk A: CASP2: Report on ab initio predictions. Proteins 1997, 151–166. 10.1002/(SICI)1097-0134(1997)1+<151::AID-PROT20>3.3.CO;2-J
Jones T, Kleywegt G: CASP3 comparative modeling evaluation. Proteins 1999, S3: 30–46. 10.1002/(SICI)1097-0134(1999)37:3+<30::AID-PROT6>3.3.CO;2-J
Murzin A: Structure classificiation-based assessment of CASP3 predictions for the fold recognition targets. Proteins 1999, S3: 88–103. 10.1002/(SICI)1097-0134(1999)37:3+<88::AID-PROT13>3.3.CO;2-V
Orengo CA, Bray J, Hubbard T, LoConte L, Sillitoe J: Analysis and assessment of ab initio three-dimensional prediction, secondary structure, and contacts prediction. Proteins 1999, S3: 149–170. 10.1002/(SICI)1097-0134(1999)37:3+<149::AID-PROT20>3.3.CO;2-8
Samudrala R, Pedersen J, Zhou H, Luo R, Fidelis K, Moult J: Confronting the problem of interconnected structural changes in the comparative modelling of proteins. Proteins: Struct Fund Genet 1995, 23: 327–336.
Samudrala R, Moult J: Handling context-sensitivity in protein structures using graph theory: bona fide prediction. Proteins: Struct Fund Genet 1997, 29S: 43–49. 10.1002/(SICI)1097-0134(1997)1+<43::AID-PROT7>3.3.CO;2-Z
Koehl P, Levitt M: A brighter future for protein structure prediction. Nat Struct Biol 1999, 6: 108–111. 10.1038/5794
Murzin A, Hubbard T: Prediction targets of CASP4. Proteins: Struct Fund Genet 2001, S5: 8–12. 10.1002/prot.10055
Jones D: GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequence. J Mol Biol 1999, 287: 797–815. 10.1006/jmbi.1999.2583
Karplus K, Barrett C, Cline M, Diekhans M, Grate L, Hughey R: Predicting protein structure using only sequence information. Proteins: Struct Fund Genet 1999, S3: 121–125. Publisher Full Text 10.1002/(SICI)1097-0134(1999)37:3+<121::AID-PROT16>3.3.CO;2-H
Bujnicki J, Elofsson A, Fischer D, Rychlewski L: Structure prediction meta server. Bioinformatics 2001, 17: 750–751. 10.1093/bioinformatics/17.8.750
Samudrala R, Huang E, Koehl P, Levitt M: Side chain construction on near-native main chains for ab initio protein structure prediction. Protein Eng 2000, 7: 453–457. 10.1093/protein/13.7.453
Samudrala R, Moult J: Determinants of side chain conformational preferences in protein structures. Protein Eng 1998, 11: 991–997. 10.1093/protein/11.11.991
Samudrala R, Moult J: An all-atom distance dependent conditional probability discriminatory function for protein structure prediction. J Mol Biol 1998, 275: 895–916. 10.1006/jmbi.1997.1479
Park B, Levitt M: The complexity and accuracy of discrete state models of protein structure. J Mol Biol 1995, 249: 493–507. 10.1006/jmbi.1995.0311
Bron C, Kerbosch J: Algorithm 457: Finding all cliques of an undirected graph. Comm ACM 1973, 16: 575–577. 10.1145/362342.362367
Samudrala R, Moult J: A graph-theoretic algorithm for comparative modelling of protein structure. J Mol Biol 1998, 279: 287–302. 10.1006/jmbi.1998.1689
Levitt M, Hirshberg M, Sharon R, Daggett V: Potential energy function and parameters for simulations of the molecular dynamics of proteins and nucleic acids in solution. Comp Phys Comm 1995, 91: 215–231. 10.1016/0010-4655(95)00049-L
Gouzy J, Corpet F, Kahn D: Whole genome protein domain analysis using a new method for domain clustering. Comp and Chem 1999, 23: 333–340. 10.1016/S0097-8485(99)00011-X
Jones D: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292: 195–202. 10.1006/jmbi.1999.3091
Simons K, Kooperberg C, Huang E, Baker D: Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. J Mol Biol 1997, 268: 209–225. 10.1006/jmbi.1997.0959
Samudrala R, Xia Y, Levitt M, Huang E: A combined approach for ab initio construction of low resolution protein tertiary structures from sequence. In: Proceedings of the Pacific Symposium on Biocomputing (Edited by: Altman R, Dunker A, Hunter L, Klein T, Lauderdale K). World Scientific Press 1999, 505–516.
Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E: Equation of state calculations by fast computing machine. J Comput Phys 1953, 21: 1087–1092.
Pedersen JT, Moult J: Folding simulation with genetic algorithms and a detailed molecular description. J Mol Biol 1997, 269: 240–259. 10.1006/jmbi.1997.1010
Dandekar T, Argos P: Applying experimental data to protein fold prediction with the genetic algorithm. Protein Eng 1997, 10: 877–893. 10.1093/protein/10.8.877
Huang E, Subbiah S, Levitt M: Recognising native folds by the arrangement of hydrophobic and polar residues. J Mol Biol 1995, 252: 709–720. 10.1006/jmbi.1995.0529
Plaxco K, Simons K, Baker D: Contact order, transition state placement, and the refolding rates of single domain proteins. J Mol Biol 1998, 277: 985–994. 10.1006/jmbi.1998.1645
Brooks B, Bruccoleri R, Olafson B, States D, Swaminathan S, Karplus M: CHARMM: A program for macromolecular energy, minimization, and dynamics calculations. J Comp Chem 1983, 4: 187–217.
Fischer D, Barret C, Bryson K, Elofsson A, Godzik A, Jones D, Karplus K, Kelley L, MacCallum R, Pawowski K, Rost B, Rychlewski L, Sternberg M: CAFASP-1: critical assessment of fully automated structure prediction methods. Proteins: Struct Fund Genet 1999, S3: 209–217. Publisher Full Text 10.1002/(SICI)1097-0134(1999)37:3+<209::AID-PROT27>3.3.CO;2-P
Levitt M, Gerstein M: A Unified Statistical Framework for Sequence Comparison and Structure Comparison. Proc Natl Acad Sci USA 1998, 95: 5913–5920. 10.1073/pnas.95.11.5913
Samudrala R, Levitt M: Decoys 'R' Us: A A database of incorrect protein conformations to improve protein structure prediction. Protein Sci 2000, 9: 1399–1401.
Sanchez R, Sali A: Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc Natl Acad Sci USA 1998, 95: 13597–13602. 10.1073/pnas.95.23.13597
Martin-Renom M, Stuart A, Fiser A, Sanchez R, Melo F, Sali A: Comparative protein structure modeling of genes and genomes. Annu Rev Biophy Biomol Struct 2000, 29: 291–325. 10.1146/annurev.biophys.29.1.291
Brenner S, Levitt M: Expectations from structural genomics. Protein Sci 2000, 9: 197–200.
Bonneau R, Baker D: Ab initio protein structure prediction: Progress and prospects. Annu Rev Biophy Biomol Struct 2001, 30: 173–189. 10.1146/annurev.biophys.30.1.173
Samudrala R, Xia Y, Levitt M, Cotton N, Huang E, Davis R: Probing structure-function relationships of the dna polymerase alpha-associated zinc-finger protein using computational approaches. In: Proceedings of the Pacific Symposium on Biocomputing (Edited by: Altman R, Dunker A, Hunter L, Klein T, Lauderdale K). World Scientific Press 2000, 179–189.
Wei L, Huang E, Altman R: Are predicted structures good enough to preserve functional sites? Structure 1999, 7: 643–650. 10.1016/S0969-2126(99)80085-9
Hofmann K, Bucher P, Falquet L, Bairoch A: The PROSITE database, its status in 1999. Nucleic Acids Res 1999, 27: 215–219. 10.1093/nar/27.1.215
Attwood T, Croning M, Flower D, Lewis A, Mabey J, Scordis P, Selley J, Wright W: PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res 2000, 28: 225–227. 10.1093/nar/28.1.225
Henikoff J, Green E, Pietrokovski S, Henikoff S: Increased coverage of protein families with the blocks database servers. Nucleic Acids Res 2000, 28: 228–230. 10.1093/nar/28.1.228
May A, Johnson M, Rufino S, Wako H, Zhu Z, Sowdhamini R, Srinivasan N, Rodionov M, Blundell T: The recognition of protein structure and function from sequence: adding value to genome data. Phil Trans Roy Soc Lond 1994, 344: 373–381.
Van Loy C, Sokurenko E, Samudrala R, Moseley S: Identification of a DAF binding domain in the Dr adhesin. Mol Microbiol (to appear) 2002.
Baker D, Sali A: Protein structure prediction and structural genomics. Science 2001, 294: 93–96. 10.1126/science.1065659
Lander E: Array of hope. Nat Genet 1999, 21: 3. 10.1038/4427
Schwikowski B, Uetz P, Fields S: A network of protein-protein interactions in yeast. Nature Biotechnol 2000, 18: 1242–1243. 10.1038/82360
Gygi S, Rist B, Gerber S, Turecek F, Gelb M, Aebersold R: Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nature Biotechnol 1999, 17: 994–999. 10.1038/13690
Ideker T, Thorsson V, Ranish J, Christmas R, Buhler J, Eng J, Bumgarner R, Goodlett D, Aebersold R, Hood L: Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 2001, 292: 929–934. 10.1126/science.292.5518.929
Satow Y, Cohen G, Padlan E, Davies D: Phosphocholine binding immunoglobulin Fab McPC603. An X-ray diffraction study at 2.7 Å. J Mol Biol 1986, 190: 593–604.
Shindyalov I, Bourne P: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 1998, 11: 739–747. 10.1093/protein/11.9.739
This work was supported in part by a Burroughs Wellcome Fund Fellowship from the Program in Mathematics and Molecular Biology to Ram Samudrala and NIH grant GM-41455 Michael Levitt. We thank Patrice Koehl for his efficient FORTRAN code used for constructing protein structures and calculating RMSDs. We also thank past and present members of the Levitt group for their intellectual guidance and support. Most importantly, we thank all the experimentalists who provided targets for CASP, and the assessors and the organisers for conducting a stimulating experiment.
RS performed the algorithm development, carried out the computations, evaluated the results, and drafted the manuscript. ML helped with the algorithm development and evaluation of the results produced, and provided intellectual guidance and mentorship. All authors read and approved the final manuscript.