CRYSTALP2: sequence-based protein crystallization propensity prediction
© Kurgan et al; licensee BioMed Central Ltd. 2009
Received: 5 January 2009
Accepted: 31 July 2009
Published: 31 July 2009
Current protocols yield crystals for <30% of known proteins, indicating that automatically identifying crystallizable proteins may improve high-throughput structural genomics efforts. We introduce CRYSTALP2, a kernel-based method that predicts the propensity of a given protein sequence to produce diffraction-quality crystals. This method utilizes the composition and collocation of amino acids, isoelectric point, and hydrophobicity, as estimated from the primary sequence, to generate predictions. CRYSTALP2 extends its predecessor, CRYSTALP, by enabling predictions for sequences of unrestricted size and provides improved prediction quality.
A significant majority of the collocations used by CRYSTALP2 include residues with high conformational entropy, or low entropy and high potential to mediate crystal contacts; notably, such residues are utilized by surface entropy reduction methods. We show that the collocations provide complementary information to the hydrophobicity and isoelectric point. Tests on four datasets show that CRYSTALP2 outperforms several existing sequence-based predictors (CRYSTALP, OB-score, and SECRET). CRYSTALP2's accuracy, MCC, and AROC range between 69.3 and 77.5%, 0.39 and 0.55, and 0.72 and 0.79, respectively. Our predictions are similar in quality and are complementary to the predictions of the most recent ParCrys and XtalPred methods. Our results also suggest that, as work in protein crystallization continues (thereby enlarging the population of proteins with known crystallization propensities), the prediction quality of the CRYSTALP2 method should increase. The prediction model and the datasets used in this contribution can be downloaded from http://biomine.ece.ualberta.ca/CRYSTALP2/CRYSTALP2.html.
CRYSTALP2 provides relatively accurate crystallization propensity predictions for a given protein chain that either outperform or complement the existing approaches. The proposed method can be used to support current efforts towards improving the success rate in obtaining diffraction-quality crystals.
Structural genomics is a word-wide initiative aimed at producing a comprehensive mapping of the protein structure space . The resulting knowledge of the tertiary structure of proteins will be vitally important for understanding and manipulating the biochemical and cellular functions of a given protein. This is an important step in rational drug design  and provides valuable insights into important diseases . There are several different ways to obtain the structure including X-ray diffraction, electron microscopy, and NMR. Although a majority of protein structures are obtained using the first method, the two latter approaches play a strong complementary role for some protein types, such as membrane proteins [4–6]. One of the main challenges the structural genomics initiative faces it that only about 2–10% of protein targets pursued yield high-resolution protein structures . Several strategies have been proposed to improve the success rate, including obtaining one representative structure per protein family and working with multiple orthologues [8–11]. One of the most important bottlenecks in acquiring the structures is obtaining diffraction-quality crystals [12–14]. At the same time, crystallization is characterized by a significant rate of attrition and is among the most complex and least understood problems in structural biology . Current protocols yield crystals for approximately 30% of the input proteins and well-diffracting crystals for an even smaller fraction . This motivated the development of models that can be used to either support or directly predict protein crystallization . For instance, the isoelectric point (pI) calculated from a primary sequence was used in a method that suggests optimal pH ranges for crystallization screening [16, 17]. Several other investigations suggest that features derived from protein sequences can be used for predicting crystallization propensity [18, 19]. To this end, a few in-silico methods that predict crystallization propensity using the primary sequence as the input have recently been developed. They include SECRET , OB-Score , CRYSTALP , and most recently ParCrys . SECRET and CRYSTALP accept only sequences between 46 and 200 amino acids (AAs) in length. Although OB-score does not impose a limit on sequence size, it considers only two predictive features (pI and hydrophobicity), which limits the quality of its predictions. The ParCrys method extends OB-score by using a kernel-based classification algorithm and adding the composition vector of several amino acids (including Ser, Cys, Gly, Phe, Tyr, and Met) to the set of predictive features. All of these methods are built using black-box classification models, which are inductively learned from a set of protein chains, all annotated as crystallizable and noncrystallizable. By contrast, the XtalPred method  is a white-box approach that combines probabilities of successful crystallization calculated from several protein features. This method, which was developed based on experiences at the Joint Center for Structural Genomics, strives to mimic the work performed by structural biologists. XtalPred compares nine biochemical and biophysical features of an input protein with probability distributions estimated from data from the TargetDB database http://targetdb.pdb.org/. These features include protein length, molecular mass, Gravy and instability indices, extinction coefficient, isoelectric point, content of Cys, Met, Trp, Tyr, and Phe residues, insertions in the alignment compared to homologues in a non-redundant database of protein sequences, predicted secondary structure, predicted disordered, low-complexity and coiled-coil regions, and predicted transmembrane helices and signal peptides. The individual probabilities are combined into a single crystallization score which is used to assign one of five crystallization classes: optimal, suboptimal, average, difficult, and very difficult. The XtalPred provides a good benchmark for comparison since it uses a sophisticated sequence analysis (including several predictions) and models the routine "manual" work of structural biologists.
In the current article, we extend the CRYSTALP method to improve the quality of the predictions and to remove the sequence size restriction. When compared with CRYSTALP, the proposed CRYSTALP2 method uses new predictive features that are based on the collocation of amino acids in the sequence [22, 26–29], includes information about pI and hydrophobicity, and applies a kernel-based classifier. Our goal is to provide a relatively simple method, i.e., we do not use sophisticated sequence analysis. We expect that our method will thus be complementary to current methods including XtalPred and ParCrys. We also note that many studies have shown that sequence-based prediction approaches, which may address a variety of structural and functional properties of proteins, provide useful information and insights for both basic research and drug design and hence are widely welcomed by the scientific community [30–34].
Our methodology consists of two steps: (1) the protein sequence is converted into a fixed size feature vector, and (2) the feature values are entered into the classification model to predict the protein class (crystallizable/noncrystallizable). We followed the same design procedure as in [20, 22] and our evaluation follows [20, 22, 23].
The design of the proposed method is based on a dataset of 418 proteins (hereafter D418) that includes 192 noncrystallizable and 226 crystallizable chains, which was introduced in . Following the approach taken to design and test SECRET  and CRYSTALP , the design is based on tenfold cross-validation of the D418 dataset. We compare our out-of-sample predictions on D418 with SECRET and CRYSTALP. We also employ three datasets that were recently introduced in  and a new test dataset introduced in this contribution to compare CRYSTALP2 with CRYSTALP, SECRET, OB-Score, ParCrys and XtalPred. These four datasets are drawn from the TargetDB  and PepcDB http://pepcdb.pdb.org/ databases by applying procedures established in . We use the FEAT dataset (composed of 1456 sequences, 728 crystallizable and 728 non-crystallizable) as the training dataset, while the TEST and TEST-RL datasets, composed of 144 (72 crystallizable and 72 non-crystallizable) and 86 (43 crystallizable and 43 non-crystallizable) sequences, respectively, are used as out-of-sample test sets. The sequences in the test datasets were made nonredundant (using CD-HIT  in the case of D418 and using AMPS  in the case of TEST and TEST-RL) to avoid any bias towards similar proteins and to assure independence between training and test data. The D418 and TEST-RL datasets include chains varying between 46 and 200 residues in length, while the FEAT and TEST dataset include chains of unrestricted length (minimum 42 and maximum 1169 residues). This experimental design is consistent with that in . We also introduce a new test dataset of 2000 proteins (hereafter TEST-NEW), which is used to assess the quality of predictions for recently considered targets; we note that the FEAT, TEST and REST-RL datasets are based on proteins deposited before April 2007. This dataset simulates a large scale application of the proposed method, and was also developed following the procedure in . The crystallizable proteins were extracted from sequences deposited in TargetDB. We selected the last 1000 depositions as of December 31, 2008 that are annotated as having "Diffraction-quality Crystals", and are not annotated with "In PDB" in the "Status" field. The resulting set includes proteins deposited between July 2006 and December 2008. The non-crystallizable sequences, which correspond to the actual construct sequences used, were extracted from the trial sequences stored in PepcDB. Sequences that are annotated as "work stopped" in the "Status" field and "Cloned" but not including an indicator of crystallization (e.g. "Crystals") in the "Status History" field were included in the set. Among these targets we removed DNA sequences, sequences which were annotated as "test target" and sequences for which "stopDetails" included "duplicate target found". As in the case of crystallizable chains, the remaining chains were filtered to select the last 1000 depositions as of December 31, 2008. The selected 2000 sequences were also processed to remove the N-terminal hexaHis tag (MGHHHHHHSH) and LEHHHHHH tag at the C-terminus, which are introduced to ease the purification; the same was done in . Finally, we removed duplicate sequences and, as a result, the selected 2000 protein chains are nonredundant. Our results on this dataset are compared with the predictions of the ParCrys and XtalPred methods.
where k is the length of the sequence.
The amino-acid collocation vector was first used in  and it is defined as the number of occurrences of two or more amino acids that are separated by gaps, i.e., amino acids of any type. CRYSTALP  employed a collocation vector for two amino acids (collocated dipeptides) that are separated by up to four gaps, i.e., AAiAAj, AAi-AAj, AAi--AAj, AAi---AAj, and AAi---AAj, where AAiAAj is a dipeptide, AAi-AAj is the same dipeptide separated by one amino acid of any type (denoted by -), etc. This yields 5*400 = 2000 collocation features. For CRYSTALP2 we also consider collocated tripeptides, which include 8000 tripeptides AAiAAjAAk, and 24000 tripeptides with single gaps, AAiAAj-AAk, AAi-AAjAAk, and AAi-AAj-AAk. In contrast to CRYSTALP, the number of occurrences for all collocated di- and tripeptides are normalized by the sequence length to allow predictions for sequences of unrestricted size. We note that local neighborhood information in the protein chain was also utilized in a recent method for design of crystallizable protein variants .
We also used pI and hydrophobicity as features. pI was used in OB-score , ParCrys  and XtalPred , and is strongly related to the efficiency of crystallization screening [16, 17]. The pI values were computed using the ExPASy server  based on pK values of amino acids described in . Sequence-based hydrophobicity was also used in [21, 23]. As in , the hydrophobicity was calculated as the sum of Goldmann-Engleman-Steiz (GES) hydrophobicity values  for all residues, divided by the sequence length. The total number of features computed is 34,022.
Selected set of features.
DL, ES, GL, HH, IR, LF, LS, PP, QG, QM, RI, SS, SV, WC, WM, WV, WW, YI, YT, C-A, D-L, H-G, H-H, H-R, I-R, L-E, Q-L, R-S, T-K, T-S, T-T, D--M, F--S, H--C, H--H, K--W, L--N, S--L, T--G, W--W, Y--N, E---Q, E---S, F---T, G---H, L---D, L---L, Q---C, R---D, V---Y, Y---I, C----E, C----H, C----S, E----F, E----Q, G----R, I----E, L----L, M----V, M----Y, S----H, V----T, W----H, W----M
EFV, IVV, TKV, F-TK, K-TV, M-DS, P-PE, Q-QQ, R-PS, DP-V, LR-F, MG-S, SA-D, VT-G, YV-E, F-E-F, K-I-R, N-P-G, S-T-S
pI, average hydropho-bicity
The SECRET and ParCrys methods employ kernel-based classifiers as their prediction models. SECRET uses Support Vector Machines with Gaussian kernels, while ParCrys employs the Parzen window density estimator. We use another kernel-based technique, the normalized Gaussian radial basis function (RBF) network, which is a neural network with a hidden layer based on the non-linear Gaussian kernel function. In contrast to classical RBF networks , the normalized RBF (NRBF) networks have been shown to improve generalization, which leads to better performance on unseen test data . We utilized the NRBF implementation in WEKA , in which the RBF functions are computed using the k-means clustering algorithm, i.e., symmetric multivariate Gaussians are fitted to the data for each k-means generated cluster, and the classification is based on logistic regression. This classifier requires the number of clusters, the width of the Gaussian kernel, and the ridge value for the logistic regression to be specified as training parameters. The number of clusters equals 2, which is the number of classes (prediction outcomes) in our problem. The other two parameters were selected based on a grid search using tenfold cross-validation tests on the D418 dataset. The best classification accuracy was obtained for a ridge value of 140 and kernel width 2.0. We note that each prediction generated by CRYSTALP2 is associated with a confidence score, defined as the difference between the probabilities of the two outcomes. The NRBF network generates a probability that a given input chain is predicted as crystallizable and as non-crystallizable. CRYSTALP2 predicts that a diffraction-quality crystal can be obtained when the confidence for this class is greater than that for the non-crystallizable class.
Results and discussion
Comparison with competing methods
Comparison of prediction quality measured via accuracy, MCC and AROC between the proposed and five competing methods.
Table 2 shows that CRYSTALP2 provides an improvement over CRYSTALP. While both methods show the same quality on the D412 dataset, CRYSTALP performs relatively poorly on the TEST-RL dataset. This is likely due to the input features not being normalized in this method; the TEST-RL set has a different distribution of protein chain sizes than the D418 set. We observe that CRYSTALP2 obtains MCC = 0.4 on this test set, which is similar to the result of OB-Score and worse only than the results of ParCrys and XtalPred. At the same time, the proposed method outperforms all competing methods except XtalPred on the TEST set, which is larger than the TEST-RL dataset and contains chains of unrestricted size. The tests on the largest TEST-NEW dataset indicate that the three top performing methods, ParCrys, XtalPred and CRYSTALP2, provide similar performance with accuracy of about 70%, and MCC and AROC around 0.4 and 0.75, respectively.
The ROC curves in Figure 1 were generated for the three best performing methods (CRYSTALP2, ParCrys, and XtalPred) on the TEST, TEST-RL and TEST-NEW datasets to facilitate a more detailed comparison. We observe that for the TEST dataset CRYSTALP2 outperforms ParCrys for low and mid-range values of FP rate (when a relatively low number of chains is incorrectly classified as crystallizable), while ParCrys generates slightly higher TP rates for FP rate > 0.6. CRYSTALP2 would thus be more appropriate than ParCrys when the cost of incorrectly classifying a chain as crystallizable is significant. XtalPred is shown to generally outperform both ParCrys and CRYSTALP2 on this dataset. In the case of the TEST-RL dataset ParCrys and XtalPred are shown to provide favorable prediction quality when compared with CRYSTALP2. Finally, the ROC curves on the largest TEST-NEW dataset show that the three methods are characterized by similar performance across the entire range of the FP and TP rates. Overall, although XtalPred seems to provide good performance on all three datasets, we observe that there is no clear cut winner and that all three methods provide relatively comparable prediction quality.
Comparison of predictions generated by CRYSTALP2, XtalPred and ParCrys on the TEST, TEST-RL and TEST-NEW datasets.
Discussion of the proposed sequence representation
The 88 features selected for CRYSTALP2 include elements of the composition and collocation vector, which are computed directly from the sequence, and pI and hydrophobicity, which are derived from the sequence by considering specific physicochemical properties of the amino acid chains. We note that the two latter features were used in several past studies [16, 17, 21, 23], while the former set of 86 features is introduced in this work as an extension of work done in . We investigate whether these two sources of data, i.e., sequence and physicochemical properties of the sequence, provide complementary or redundant information in the context of predicting crystallization propensity.
Comparison of prediction quality measured via accuracy, MCC and AROC between the proposed method that uses the set of 88 features (including composition, collocation, pI and hydrophobicity), a method that uses the 86 composition and collocation features, and a method that uses only pI and hydrophobicity features.
Method (# features)
only pI and hydrophobicity (2 features)
only composition and collocation (86 features)
CRYSTALP2 (88 features)
only pI and hydrophobicity (2 features)
only composition and collocation (86 features)
CRYSTALP2 (88 features)
only pI and hydrophobicity (2 features)
only composition and collocation (86 features)
CRYSTALP2 (88 features)
In the following we investigate individual features used by CRYSTALP2. We show that the features based on the collocation of residues involve amino acids types that are also utilized in the crystallization enhancing mutagenesis. We then discuss the association of the individual features with the prediction outcomes.
The surface entropy reduction approach, i.e. point-mutation-based replacement of solvent-exposed residues having high conformational entropy (e.g. Glu (E), Gln (Q), and Lys (K)), with residues having lower conformational entropy and higher potential to mediate crystal contacts (such as Ala (A), Tyr (Y), Thr (T), Ser (S), and His (H)) provides a viable strategy to minimize the loss of conformational entropy upon crystallization and renders crystallization thermodynamically favorable [46, 47, 37]. The sites for mutagenesis are usually chosen considering their proximity in the sequence [37, 47, 48], which conceptually resembles our collocation vector approach. At the same time, the ParCrys and XtalPred methods use the composition of several AA types without considering their proximity. The eight AA types involved in surface entropy reduction are likely to be indicative of proteins with low/high crystallization propensity, and they occur in 73% of the features used by CRYSTALP2. Since the combined abundance of these AAs in protein chains is about 41%, their higher occurrence rate in our feature set demonstrates that CRYSTALP2 implicitly applies information about conformational entropy. We note that ParCrys uses the composition of Ser (S), Gly (G), Cys (C), Phe (F), Tyr (Y), and Met (M) AAs. Only two of these AA types are associated with the residues that are suggested in crystallization enhancing mutagenesis, which further supports our claim of complementarity between CRYSTALP2 and ParCrys. Similarly, XtalPred analyzes the composition of Cys (C), Met (M), Trp (W), Tyr (Y), and Phe (F) AAs, and again among these amino acid types only Y appears in the context of the mutagenesis.
Since CRYSTALP2 uses a nonlinear, black-box model to represent the relation between all input features taken together and the prediction outcomes, it is not possible to directly use this model to determine the associations of individual features with a specific outcome. Instead, we computed the biserial correlation coefficients between individual features and the annotation of the corresponding protein chains (crystallizable vs. noncrystallizable) to quantify the strength of the associations. Overall, we observe that 75 features used by CRYSTALP2 are characterized by weak absolute correlation coefficient values (<0.1). While individually these features little useful information, the classification model exploits these individually weak correlations by combining information from multiple features. The remaining 13 features having higher coefficient values include (the correlation coefficients are shown in brackets) L-E (0.28), SS (0.25), L (0.20), T-S (0.16), GL (0.15), R-S (0.14), I----E (0.14), L---L (0.14), F--S (0.12), E----F (0.11), S----H (0.11), S-T-S (0.11) and pI (0.3). We observe that the above collocations include AA types which are complementary to the AA types utilized by XtalPred (C, F, M, W, and Y; only one AA type, F, is in common). The same is true when we consider ParCrys, which uses the composition of C, F, G, M, S, and Y (only F, G, and S, are in common).
Analysis of CRYSTALP2 predictions
We additionally examine the results obtained by CRYSTALP2 in our second test (training on FEAT, testing on the TEST-NEW, TEST and TEST-RL datasets). Two questions are of interest: 1) could the prediction quality improve if the size of the FEAT dataset were increased (more crystallization reports would become available)? and 2) how does the proposed method performs for each of the prediction outcomes.
Impact of the size of the training dataset
In TEST-RL, the prediction quality varies more substantially than for the TEST and TEST-NEW datasets. In spite of the above, we can discern a general upward trend in prediction quality for the three datasets. The trends for the TEST and TEST-NEW datasets are clearer and we observe that the prediction quality improves as more of the FEAT dataset is included in training, and reaches its maximum when the entire FEAT dataset is used. Most importantly, we observe that the rate of improvement is relatively constant, even when considering large fractions of the training dataset, i.e., 80, 90, and 100%. Interpolation of this trend suggests that inclusion of additional data in the training dataset could result in a further increase of the prediction quality.
The linear regressions in Figure 4 show that the improvements have larger magnitude for the TEST and TEST-NEW datasets than for the TEST-RL dataset, which highlights the difference between these datasets. We note that FEAT, TEST-NEW and TEST include sequences of unrestricted size, while TEST-RL only includes sequences between 46 and 200 residues in length. This difference in the distribution of the sequence sizes is a likely cause of the stronger improvements in the case of the TEST and TEST-NEW datasets.
Results for prediction of crystallizable and noncrystallizable proteins
Comparison of prediction quality measured with sensitivity and specificity for the prediction of the crystallizable and noncrystallizable proteins by the CRYSTALP2 method.
We introduce a novel algorithm, CRYSTALP2, that predicts the propensity of a given protein chain to generate diffraction-quality crystals via current structural biology techniques. Our results indicate that hydrophobicity, isoelectric point, and the frequency of certain collocated di- and tripeptides are important predictors of crystallization. We show that the collocation features provide a complementary source of information when compared with the hydrophobicity and isoelectric point. CRYSTALP2 associates AA collocations corresponding to clusters of residues having low conformational entropy and high potential to mediate crystal contacts with crystallizable proteins. Clusters of residues having high conformational entropy are associated with the non-crystallizable proteins. Such patterns could serve as potential crystallization markers.
Test on several independent datasets show that CRYSTALP2 outperforms several existing methods such as SECRET, CRYSTALP and the OB-Score, and provides comparable and complementary results to the ParCrys and XtalPred methods. The complementarity between CRYSTALP2 and XtalPred suggests that the proposed black-box method is a useful adjunct to the current manual techniques of structural biologists, which are modelled in XtalPred. Our results suggest that an increase of the size of the training set, which would be caused by the continuing protein crystallization efforts, may results in an increase of the prediction quality of the CRYSTALP2. We also show that the proposed method performs better in predicting crystallizable proteins when compared with predicting noncrystallizable proteins.
We note that our method and all competing approaches, i.e., SECRET, CRYSTALP, OB-Score, XtalPred and ParCrys, take into account only intra-molecular factors that are encoded in the protein chain. They may not provide reliable predictions when inter-molecular factors such as protein-protein and/or protein-precipitant interactions, buffer composition, precipitant diffusion method, gravity, etc. must be considered. All of these sequence-based predictors are limited to predicting crystallization propensity for non-redundant chains; they should not be used when assessing crystallization of homologues. In the latter case we recommend the use of the web server at http://www.doe-mbi.ucla.edu/Services/SER. Finally, our predictions concern only soluble proteins, as only such proteins were used to train and validate the prediction methods. In spite of these limitations, methods such as the proposed CRYSTALP2 should find useful applications. For instance, a potential application area is the Structural Genomics initiative where structures are sought for a protein that represents a given protein family rather than for a particular protein chain [8–11].
This research was supported in part by NSERC under the Discovery Grants program. The authors would like to thank Lukasz Slabinski for help with running the XtalPred server and Ian Overton for providing the TEST, TEST-RL and FEAT datasets.
- Chandonia JM, Brenner SE: The impact of structural genomics: expectations and outcomes. Science 2006, 311: 347–351. 10.1126/science.1121018View ArticlePubMedGoogle Scholar
- Norin M, Sundström M: Protein models in drug discovery. Curr Opin Drug Discov Devel 2001, 4: 284–290.PubMedGoogle Scholar
- Fernàndez-Busquets X, de Groot NS, Fernandez D, Ventura S: Recent structural and computational insights into conformational diseases. Curr Med Chem 2008, 15: 1336–49. 10.2174/092986708784534938View ArticlePubMedGoogle Scholar
- Lacapère JJ, Pebay-Peyroula E, Neumann JM, Etchebest C: Determining membrane protein structures: still a challenge! Trends Biochem Sci 2007, 32(6):259–70. 10.1016/j.tibs.2007.04.001View ArticlePubMedGoogle Scholar
- Schnell JR, Chou JJ: Structure and mechanism of the M2 proton channel of influenza A virus. Nature 2008, 451: 591–595. 10.1038/nature06531PubMed CentralView ArticlePubMedGoogle Scholar
- Xu C, Gagnon E, Call ME, Schnell JR, Schwieters CD, Carman CV, Chou JJ, Wucherpfennig KW: Regulation of T cell receptor activation by dynamic membrane binding of the CD3epsilon cytoplasmic tyrosine-based motif. Cell 2008, 135(4):702–713. 10.1016/j.cell.2008.09.044PubMed CentralView ArticlePubMedGoogle Scholar
- Service R: Structural genomics, round 2. Science 2005, 307: 1554–1558. 10.1126/science.307.5715.1554View ArticlePubMedGoogle Scholar
- Brenner SE: Target selection for structural genomics. Nat Struct Biol 2000, 7: 967–969. 10.1038/80747View ArticlePubMedGoogle Scholar
- Chandonia JM, Brenner SE: Implications of structural genomics target selection strategies: Pfam whole genome, and random approaches. Proteins 5000, 58: 166–179. 10.1002/prot.20298View ArticleGoogle Scholar
- Hui R, Edwards A: High-throughput protein crystallization. J Struct Biol 2003, 142: 154–61. 10.1016/S1047-8477(03)00046-7View ArticlePubMedGoogle Scholar
- Savchenko A, Yee A, Khachatryan A, Skarina T, Evdokimova E, Pavlova M, Semesi A, Northey J, Beasley S, Lan N, Das R, Gerstein M, Arrowmith CH, Edwards AM: Strategies for structural proteomics of prokaryotes: quantifying the advantages of studying orthologous proteins and of using both NMR and x-ray crystallography approaches. Proteins 2003, 50: 392–399. 10.1002/prot.10282View ArticlePubMedGoogle Scholar
- Biertumpfel C, Basquin J, Suck D: Practical implementations for improving the throughput in a manual crystallization setup. J Appl Cryst 2005, 38: 568–570. 10.1107/S0021889805008277View ArticleGoogle Scholar
- Chayen NE: Turning protein crystallisation from an art into a science. Curr Opin Struct Biol 2004, 14: 577–583. 10.1016/j.sbi.2004.08.002View ArticlePubMedGoogle Scholar
- Puesy M, Liu ZJ, Tempel W, Praissman J, Lin D, Wang BC, Gavira JA, Ng JD: Life in the fast lane for protein crystallization and X-ray crystallography. Prog Biophys Mol Biol 2005, 88: 359–386. 10.1016/j.pbiomolbio.2004.07.011View ArticleGoogle Scholar
- Rupp B, Wang JW: Predictive models for protein crystallization. Methods 2004, 34: 391–408. 10.1016/j.ymeth.2004.03.031View ArticleGoogle Scholar
- Kantardjieff KA, Rupp B: Protein isoelectric point as a predictor for increased crystallization screening efficiency. Bioinformatics 2004, 20: 2162–2168. 10.1093/bioinformatics/bth066View ArticlePubMedGoogle Scholar
- Kantardjieff KA, Jamshidian M, Rupp B: Distributions of pI vs pH provide strong prior information for the design of crystallization screening experiments. Bioinformatics 2004, 20: 2171–2174. 10.1093/bioinformatics/bth453View ArticleGoogle Scholar
- Canaves JM, Page R, Wilson IA, Stevens RC: Protein biophysical properties that correlate with crystallisation success in Thermotoga maritima: maximum clustering strategy for structural genomics. J Mol Biol 2004, 344: 977–991. 10.1016/j.jmb.2004.09.076View ArticlePubMedGoogle Scholar
- Goh CS, Lan N, Douglas SM, Wu B, Echols N, Smith A, Milburn D, Montelione GT, Zhao H, Gerstein M: Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analyses. J Mol Biol 2004, 336: 115–130. 10.1016/j.jmb.2003.11.053View ArticlePubMedGoogle Scholar
- Smialowski P, Schmidt T, Cox J, Kirschner A, Frishman D: Will my protein crystallize? A sequence-based predictor. Proteins 2006, 62: 343–355. 10.1002/prot.20789View ArticlePubMedGoogle Scholar
- Overton IM, Barton GJ: A normalised scale for structural genomics target ranking: the OB-Score. FEBS Lett 2006, 580: 4005–4009. 10.1016/j.febslet.2006.06.015View ArticlePubMedGoogle Scholar
- Chen K, Kurgan L, Rahbari M: Prediction of protein crystallization using collocation of amino acid pairs. Biochem Biophys Res Commun 2007, 355: 764–769. 10.1016/j.bbrc.2007.02.040View ArticlePubMedGoogle Scholar
- Overton IM, Padovani G, Girolami MA, Barton GJ: ParCrys: a Parzen window density estimation approach to protein crystallization propensity prediction. Bioinformatics 2008, 24: 901–907. 10.1093/bioinformatics/btn055View ArticlePubMedGoogle Scholar
- Slabinski L, Jaroszewski L, Rodrigues APC, Rychlewski L, Wilson IA, Lesley SA, Godzik A: The challenge of protein structure determination – lessons from structural genomics. Protein Science 2007, 16(11):2472–82. 10.1110/ps.073037907PubMed CentralView ArticlePubMedGoogle Scholar
- Chen L, Oughtred R, Berman HM, Westbrook J: TargetDB: a target registration database for structural genomics projects. Bioinformatics 2004, 20(16):2860–2. 10.1093/bioinformatics/bth300View ArticlePubMedGoogle Scholar
- Campbell K, Kurgan L: Sequence-only based prediction of β-turn location and type using collocation of amino acid pairs. Open Bioinf J 2008, 2: 37–49. 10.2174/1875036200802010037View ArticleGoogle Scholar
- Chen K, Kurgan L, Ruan J: Prediction of flexible/rigid regions in proteins from sequences using collocated amino acid pairs. BMC Struct Biol 2007, 7: 25. 10.1186/1472-6807-7-25PubMed CentralView ArticlePubMedGoogle Scholar
- Chen K, Jiang Y, Du L, Kurgan L: Prediction of integral membrane protein type by collocated hydrophobic amino acid pairs. Comput Chem 2009, 30(1):163–172. 10.1002/jcc.21053View ArticleGoogle Scholar
- Chen YZ, Tang YR, Sheng ZY, Zhang Z: Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs. BMC Bioinformatics 2008, 9: 101. 10.1186/1471-2105-9-101PubMed CentralView ArticlePubMedGoogle Scholar
- Chou KC, Shen HB: Recent progresses in protein subcellular location prediction. Anal Biochem 2007, 370: 1–16. 10.1016/j.ab.2007.07.006View ArticlePubMedGoogle Scholar
- Chou KC: Progress in protein structural class prediction and its impact to bioinformatics and proteomics. Cur Prot Pept Science 2005, 6: 423–436. 10.2174/138920305774329368View ArticleGoogle Scholar
- Chou KC: Structural bioinformatics and its impact to biomedical science. Cur Med Chem 2004, 11: 2105–2134.View ArticleGoogle Scholar
- Kurgan LA, Cios KJ, Zhang H, Zhang T, Chen K, Shen S, Ruan J: Sequence-based methods for real value predictions of protein structure. Cur Bioinformatics 2008, 3(3):183–196. 10.2174/157489308785909197View ArticleGoogle Scholar
- Yang ZR, Wang L, Young N, Chou KC: Pattern recognition methods for protein functional site prediction. Cur Prot Pept Science 2005, 6: 479–491. 10.2174/138920305774329322View ArticleGoogle Scholar
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22: 1658–1659. 10.1093/bioinformatics/btl158View ArticlePubMedGoogle Scholar
- Barton GJ, Sternberg MJE: A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J Mol Biol 1987, 198: 327–337. 10.1016/0022-2836(87)90316-0View ArticlePubMedGoogle Scholar
- Goldschmidt L, Cooper DR, Derewenda Z, Eisenberg D: Toward rational protein crystallization: A Web server for the design of crystallizable protein variants. Protein Sci 2007, 16: 1569–76. 10.1110/ps.072914007PubMed CentralView ArticlePubMedGoogle Scholar
- Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR, Appel RD, Bairoch A: Protein identification and analysis tools on the ExPASy server. In The Proteomics Protocols Handbook. Edited by: Walker JM. Humana Press; 2005:571–607.View ArticleGoogle Scholar
- Bjellqvist B, Basse B, Olsen E, Celis JE: Reference points for comparisons of two-dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions. Electrophoresis 1994, 15: 529–539. 10.1002/elps.1150150171View ArticlePubMedGoogle Scholar
- Engelman DM, Steitz TA, Goldman A: Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Ann Rev Biophys Biophys Chem 1986, 15: 321–353. 10.1146/annurev.bb.15.060186.001541View ArticleGoogle Scholar
- Hall M: Correlation based feature selection for machine learning, Ph.D. dissertation. University of Waikato, Dept of Computer Science; 1999.Google Scholar
- Moody J, Darken Ch: Fast learning in networks of locally-tuned processing units. Neural Computation 1989, 1: 281–294. 10.1162/neco.19126.96.36.1991View ArticleGoogle Scholar
- Bugmann G: Normalized Gaussian Radial Basis Function networks. Neurocomputing 1998, 20: 97–110. 10.1016/S0925-2312(98)00027-7View ArticleGoogle Scholar
- Witten I, Frank E: Data Mining: Practical Machine Learning Tools and Techniques. second edition. Morgan Kaufmann, San Francisco; 2005.Google Scholar
- Slabinski L, Jaroszewski L, Rychlewski L, Wilson IA, Lesley SA, Godzik A: XtalPred: a web server for prediction of protein crystallizability. Bioinformatics 2007, 23(24):3403–5. 10.1093/bioinformatics/btm477View ArticlePubMedGoogle Scholar
- Cooper DR, Boczek T, Grelewska K, Pinkowska M, Sikorska M, Zawadzki M, Derewenda Z: Protein crystallization by surface entropy reduction: optimization of the SER strategy. Acta Crystallogr D Biol Crystallogr 2007, 63: 636–45. 10.1107/S0907444907010931View ArticlePubMedGoogle Scholar
- Derewenda Z: Rational protein crystallization by mutational surface engineering. Structure 2004, 12: 529–35. 10.1016/j.str.2004.03.008View ArticlePubMedGoogle Scholar
- Wang W, Malcolm BA: Two-stage PCR protocol allowing introduction of multiple mutations, deletions and insertions using QuikChange Site-Directed Mutagenesis. Biotechniques 1999, 26: 680–2.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.