In the present study, we used the SCOP database (version 1.73; released in December, 2007) to assess the performance of the different descriptors, train the SVM models of TIM-Finder, and construct the library of TIM-Finder. Several SCOP sequence datasets with different sequence redundancy were obtained from http://scop.mrc-lmb.cam.ac.uk/scop/[4, 37]. The downloaded SCOP_10 dataset contains 163 TIM-barrel proteins and 5,451 non-TIM-barrel proteins, and the sequence identity for any sequence pair in this dataset is ≤ 10%. Because all TIM-barrel proteins have a sequence length of more than 100 amino acids, the non-TIM-barrel proteins with less than 100 amino acids were removed. Moreover, for each non-TIM-barrel fold, only one protein was randomly selected as the final negative control. Thus, the SCOP_10 dataset was compiled into a modified dataset of 163 TIM-barrel proteins and 843 non-TIM-barrel proteins (i.e., SCOP_10_mod), which was employed to assess the performance of the different descriptors as well as training the SVM models. The SCOP_40 dataset, containing 9,536 proteins, was downloaded for the construction of the library of TIM-Finder. The downloaded SCOP_95 dataset, containing 15,273 proteins, was used to derive the motif-based descriptor.
The NCBI non-redundant (NR) sequence database was downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/ (November, 2008). The NR database was further clustered at 90% identity by using the CD-hit program , and the resulting NR90 database, containing 4,205,215 sequences, was used to implement the PSI-BLAST search. To derive the motif-based descriptor, the PROSITE release 20.27, which contains 1,318 patterns and 778 profiles, was obtained from http://www.expasy.org/prosite/.
A PSI-BLAST search for sequence A against sequence B was executed in the following two steps. First, sequence A was searched against the NR90 database by PSI-BLAST for three rounds to generate a profile. The e-value cutoff for including sequences in the profile was set at 0.001. Second, a PSI-BLAST search was performed on the obtained profile against sequence B for another round. The above PSI-BLAST search resulted in two parameters, the expected value evalue(A, B) and the bit score Score(A, B), which can be used to measure the sequence similarity between A and B. In this work, evalue(A, B) was modified according to the following equation.
Secondary structure element alignment-based descriptor
Briefly, performing a SSEA for two query sequences A and B consisted of the following three procedures. First, the secondary structure prediction for the two query sequences was carried out by PSIPRED . Second, the predicted secondary structural string was converted into a secondary structure element such that "H" represents a helix element, "E" denotes a strand element, and "C" stands for a coil element. For instance, the secondary structure string HHHHHHHCCCCEEEEEEECCCCCCCHHHHHH should be shortened to HCECH, the length of each element being retained for the scoring of SSEA. Third, the two shortened strings (i.e., secondary structure elements) were aligned using a dynamic programming algorithm  with a scoring scheme adapted from Przytycka et al. . The resulting alignment score SSEA(A, B), ranging from 0 to 1, was used as the descriptor of the similarity between two query sequences. To derive the SSEA-based descriptor, our in-house SSEA algorithm was implemented. More details about this SSEA algorithm are available in our previous study .
In this work, the PROSITE motif library was used to derive the motif-based descriptor. First, the correlation between each PROSITE motif presence and the TIM-barrel fold in the SCOP database (i.e., SCOP_95) can be quantified by a log-odds score S defined as:
where p(motif) and p(TIM) are the individual probabilities of finding a particular sequence motif and a TIM-barrel protein in the SCOP database, and p(TIM, motif) is the corresponding joint probability. We used the Perl script ps_scan ftp://ftp.expasy.org/databases/prosite/tools/ps_scan/ to compute whether a protein sequence contains a particular PROSITE motif or not. Furthermore, the motif-based compatibility between a query sequence and TIM-barrel fold can be expressed as:
where S(TIM|motif) was calculated from equation 2 and summation was performed over all motifs found in the query sequence and fulfilling the following criteria:
where C is an adjustable parameter, with 0.1 being a preliminary optimized value in this work. For a given protein sequence, a larger value of S
(TIM|sequence) means a higher chance that the sequence is a TIM-barrel protein. Therefore, S
(TIM|sequence) is used as the motif-based descriptor.
Evaluation of individual descriptors
Based on the SCOP_10_mod dataset, the three descriptors' performance in recognizing TIM-barrel proteins was individually assessed. To assess the performance of the PSI-BLAST-based descriptor, a Leave-One-Out analysis was carried out. Each time, a TIM-barrel protein was selected as a "test" protein. By calculating the similarity scores (i.e. evalue_mod(A, B)), the "test" protein was searched against all other TIM-barrel proteins in the SCOP_10_mod dataset and the protein with the most significant similarity score (i.e., the top hit) was recorded. Likewise, the non-TIM-barrel proteins were also searched against all TIM-barrel proteins. The top hits and the corresponding evalue_mod(A, B) scores were also recorded. By defining a threshold value, the TIM-barrel identification accuracy was measured by Sensitivity and Specificity with definitions as below.
Moreover, a ROC curve, which plots TPR (i.e., Sensitivity) as a function of FPR (i.e., 1-Speficity) for all possible thresholds, was also employed to measure the performance. The AUC was also calculated to provide a comprehensive understanding of the performance of the PSI-BLAST-based descriptor. Generally, the closer the AUC value is to 1, the better the descriptor is. The SSEA-based descriptor (i.e., the SSEA(A, B) score) was evaluated based on the same strategy.
Regarding the motif-based descriptor, the score S
(TIM|sequence) for each protein within the SCOP_10_mod dataset was calculated. Because S
(TIM|sequence) reflects a given sequence's compatibility with the TIM-barrel fold, it was directly used to judge whether a given protein should have the TIM-barrel fold.
Construction of TIM-Finder
In this work, the three descriptors were combined into a prediction system called TIM-Finder with the assistance of the SVM algorithm. As a machine-learning method for two classes of classification, SVM aims to find a rule that best maps each member of a training set to the correct classification [41, 42]. Here, the SVM was trained to distinguish two different protein pairs related to TIM-barrel proteins. In the first type of protein pair (i.e., positive sample), both proteins are TIM-barrel proteins. The SCOP_10_mod dataset contains 13,203 positive samples [i.e., (163 × 162)/2 = 13,203 pairs; N.B. the pair (A, B) is the same as (B, A) in this case]. In the second type of protein pair (i.e., negative sample), the first protein is of TIM-barrel fold but the second one belongs to a non-TIM-barrel protein. Thus, the SCOP_10_mod contains 137,409 negative samples [i.e., 843 × 163 = 137,409 pairs; N.B. the pair (A, B) is not the same as (B, A) in this case].
Due to the direction in which the PSI-BLAST search is carried out, the search for A against B is different from the search for B against A. In our work, the PSI-BLAST search for sequence B against A was also carried out. Thus, four parameters (i.e., evalue_mod(A, B), Score(A, B), evalue_mod(B, A) and Score(B, A)) were generated from the PSI-BLAST-based descriptor. The SSEA descriptor provides one parameter (i.e., SSEA(A, B)). Regarding the motif-based descriptor, two parameters (i.e., S
(TIM|sequence A) and S
(TIM|sequence B) were used. Thus, a total of seven parameters were used in the SVM learning.
The SCOP_10_mod dataset can be compiled into 150,612 protein pairs, which were further divided into 5 roughly equal subsets. An evaluation similar to 5-fold cross-validation was performed. To predict whether a given protein pair belongs to the first type or the second type, the subset to which this pair belongs was labeled as the "test" set, whereas the four remaining subsets were labeled as "training" sets. SVM models were developed for each of the "training" sets. The class label for positive (i.e., the first type) and negative (i.e., the second type) samples was set to +1 and -1, respectively. The ratio of positive to negative samples was 1:10 in the training set. Using the training set at such a ratio would inevitably cause the SVM model to predict every pair as a negative case. The optimized ratio in the training set was set at 1:2.5. Each training set was modified by discarding a random selection of the negative samples prior to training. The training resulted in four separate SVM models, with the predicted score being obtained as an average value over the scores from the four different SVM models.
The implemented SVM algorithm was LIB-SVM http://www.csie.ntu.edu.tw/~cjlin/. The applied kernel function was the radial basis function (RBF). The corresponding parameter settings of SVM learning were automatically optimized by LIB-SVM.
It is worth mentioning here that the predicted score for each protein pair can be regarded as a combination of the corresponding seven parameters with the assistance of SVM. Based on the predicted scores, the performance of TIM-Finder was assessed in the same way as we evaluated the individual descriptors.
Web server of TIM-Finder
To facilitate the community's research, a web server of TIM-Finder was constructed and is freely available at http://22.214.171.124/TIM-Finder/. To sufficiently represent the known structural TIM-barrel proteins as well as allow a reasonable computational time, the 322 TIM-barrel proteins in the SCOP_40 dataset were used as the library in the TIM-Finder system. To search a query sequence against the TIM-barrel library (i.e., SCOP_40_TIM), a total of 322 protein pairs are involved. For each protein pair, the corresponding seven parameters are calculated. Then, the resulting seven parameters are used as the input for the five SVM models trained in the above section, and the predicted score is obtained as an average value over the scores from the five different SVM models. Generally, the predicted score reflects the query sequence's probability of adopting a TIM-barrel fold. Finally, the predicted scores for all protein pairs are ranked, and the top 10 hits are reported. In the resulting page provided by TIM-Finder, the SCOP entry number, PDB link, prediction score, and the corresponding confidence level for each of the top 10 hits are listed. The whole process for each query normally takes about 10 minutes with a single processor on our Red Hat Enterprise Linux 5 system.
To provide confidence levels for different prediction scores resulting from TIM-finder, a stringent negative dataset based on the SCOP_40 dataset was compiled. First, in the initial SCOP_40 dataset only the non-TIM-barrel proteins that belong to α/β class (i.e., the same structural class as the TIM-barrel fold) were kept. Second, the proteins with a sequence length < 100 or > 1000 were removed. Third, the proteins that had been used in training TIM-Finder (i.e., the five SVM models) were further discarded. Finally, 1,999 non-TIM-barrel proteins retained. We processed all 1,999 proteins on TIM-Finder, and it was estimated that a prediction score = 0.82 yields a ≤ 1% FPR (i.e., 99% confidence level) and a prediction score = 0.38 indicates a ≤ 5% FPR (i.e., 95% confidence level). Compared with proteins from other structural classes, query proteins belonging to the α/β class should have a higher probability of being predicted as TIM-barrel proteins. We only selected the α/β proteins as negative controls, which should guarantee a reliable estimate of thresholds for different confidence levels.
Construction of the amino acid composition based SVM model
The AAC-based SVM model (i.e., AAC_SVM) was trained to distinguish TIM-barrel and non-TIM-barrel proteins. Briefly, the 163 TIM-barrel proteins in the SCOP_10_mod dataset were considered positive instances and their labels were set to + 1, while 843 non-TIM-barrel proteins were considered negative instances and their labels were set to - 1. The AAC for each protein was used as the input feature vector. A 10-fold cross-validation was performed. We divided SCOP_10_mod into 10 roughly equal subsets. In each evaluation step, one subset was selected for testing, while the rest nine subsets were merged into a training dataset. LIB-SVM with the RBF kernel was employed to train the SVM models, and the other SVM parameter settings were also automatically optimized by LIB-SVM. Based on the predicted SVM scores, AAC_SVM was assessed in the same way as TIM-Finder.
Proteome-wide TIM-barrel protein identification based on TIM-Finder, Fugue and PSI-BLAST
To benchmark the performance of TIM-Finder, Fugue, and PSI-BLAST, the proteome-wide TIM-barrel protein identification in B. subtilis was carried out. The whole proteome of B. subtilis, which contains 4,102 protein sequences, was obtained from ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria. The B. subtilis proteins with a sequence length <100 or >l000 amino acids were ruled out in our analysis, because they have less chance to be TIM-barrel proteins or a high possibility of containing more than one domain. Thus, 3,575 sequences were kept for further analysis.
TIM-Finder was performed on these 3,575 sequences via the established TIM-Finder server. The stand-alone version of Fugue  was provided by Dr. Kenji Mizuguchi (National Institute of Biomedical Innovation, Japan), and the corresponding fold library (i.e., the HOMSTRAD database) in its version of 05/2008 was downloaded from http://tardis.nibio.go.jp/homstrad/, which consists of 4,026 representative protein structures. The 3,575 protein sequences were processed by Fugue, and the top hits as well as the corresponding Z-scores were generated for each query sequence. As suggested by Fugue developers, a Z-score = 6.0 corresponds to a 99% confidence level and a Z-score = 4.0 indicates a 95% confidence level. For comparison, the PSI-BLAST search was also performed on these 3,575 protein sequences. As in deriving the PSI-BLAST-based descriptor, each sequence was first searched against the NR90 database by PSI-BLAST for three rounds to generate a profile. Then a PSI-BLAST search was performed on the obtained profile against the SCOP_40_TIM sequences for another round and the top hit was recorded. Based on the same procedure as we used to define the confidence levels of TIM-Finder prediction scores, it was estimated that an e-value ≤ 0.009 means a 99% confidence level and an e-value ≤ 0.066 indicates a 95% confidence level.