Machine learning integration for predicting the effect of single amino acid substitutions on protein stability

Background Computational prediction of protein stability change due to single-site amino acid substitutions is of interest in protein design and analysis. We consider the following four ways to improve the performance of the currently available predictors: (1) We include additional sequence- and structure-based features, namely, the amino acid substitution likelihoods, the equilibrium fluctuations of the alpha- and beta-carbon atoms, and the packing density. (2) By implementing different machine learning integration approaches, we combine information from different features or representations. (3) We compare classification vs. regression methods to predict the sign vs. the output of stability change. (4) We allow a reject option for doubtful cases where the risk of misclassification is high. Results We investigate three different approaches: early, intermediate and late integration, which respectively combine features, kernels over feature subsets, and decisions. We perform simulations on two data sets: (1) S1615 is used in previous studies, (2) S2783 is the updated version (as of July 2, 2009) extracted also from ProTherm. For S1615 data set, our highest accuracy using both sequence and structure information is 0.842 on cross-validation and 0.904 on testing using early integration. Newly added features, namely, local compositional packing and the mobility extent of the mutated residues, improve accuracy significantly with intermediate integration. For S2783 data set, we also train regression methods to estimate not only the sign but also the amount of stability change and apply risk-based classification to reject when the learner has low confidence and the loss of misclassification is high. The highest accuracy is 0.835 on cross-validation and 0.832 on testing using only sequence information. The percentage of false positives can be decreased to less than 0.005 by rejecting 10 per cent using late integration. Conclusion We find that in both early and late integration, combining inputs or decisions is useful in increasing accuracy. Intermediate integration allows assessing the contributions of individual features by looking at the assigned weights. Overall accuracy of regression is not better than that of classification but it has less false positives, especially when combined with the reject option. The server for stability prediction for three integration approaches and the data sets are available at .


Background
In protein design and analysis, understanding the stability in sequence, structure, and function paradigms is of importance [1] and hence there is a need for predicting the protein stability change due to mutation. Single amino acid mutations can significantly change the stability of a protein structure [2]. To acquire a set of experimental annotations for every possible random mutation is combinatorial and requires significant resources and time. Thus, accurate computational prediction would be of use for suggesting the destructive mutations as well as the most favorable and stable novel protein sequences. To this end, the prediction of protein stability change due to amino acid substitutions remains a challenging task in the field of molecular biology.
Recent approaches fall into two major types: energy-based methods and machine learning approaches. Energy-based methods using physical, statistical, or empirical forcefields perform a direct computation of the magnitude of the relative change in the free energy [3][4][5][6][7][8]. Average assignment method [7] and different machine learning algorithms, such as support vector machines [2], neural networks [9], and decision trees [7] are trained on a data set to predict protein stability change. There are also hybrid approaches that combine energy-based and machine learning methods [10][11][12]; they basically generate the input features fed into machine learning algorithms using energy-based models.
One can predict the direction towards which the mutation shifts the stability of the protein (namely the sign of ΔΔG). It could be positive or negative, corresponding to an increase or decrease in stability, respectively. From a machine learning perspective, this is a binary classification task, where given x, information about the single-site amino acid substitution, the aim is to decide whether this is a positive or negative example, depending on whether the mutation is favorable or not. A third class of "doubt" can be defined for small changes that may be considered insignificant, and in such a case, one can train a three-class classifier [13] or a two-class classifier with the reject option.
Given a sample of n independent and identically distributed training instances, (x 1 , y 1 ),(x 2 , y 2 ), ...,(x n , y n ), where x i is the d-dimensional input vector coding the relevant information and y i ∈ {-1, +1} is its class label, i = 1, ..., n, a classifier estimates P(+|x) and assigns the test instance to the positive class if P(+|x) > 0.5, and to the negative class otherwise. There can be different representations in coding x. Deciding on the best data representation used is as important as selecting the classification algorithm.
Another possibility in solving this using machine learning is to define it as a regression problem with ΔΔG directly as the numeric output. One can then decide based on whether the prediction is positive or negative, and again predictions that are close to zero can be rejected if the risk of misclassification is high. No single machine learning algorithm nor representation, in classification or regression, induces always the most accurate learner in any domain. The usual approach is to try many and choose the one that performs the best on a separate validation set unused during training. Recently, it has been shown that accuracy may be improved by combining multiple learners [14,15]. There are three possible methods for combining multiple learners: early, late, and intermediate integration [16].
In early integration, inputs are concatenated as one large vector and a single learner (classifier or regressor) is used. In late integration, multiple classifiers/regressors are trained over different inputs and their decisions are combined by a trained learner. These two approaches can be applied with any classification/regression algorithm.
Late integration has been extensively used in bioinformatics. Weighted voting was used in classifier combination for protein fold recognition [17]. Majority voting was used for prediction of the drug resistance of HIV protease mutants [18], secondary structure prediction [19], detecting rare events in human genomic DNA [20] and identification of new tumor classes using gene expression profiles [21]. A trained combiner was used for secondary structure prediction [22,23]. A mixture of localized experts was used for gene identification [24]. Cascading, which is a multi-stage sequential combination method, was used for secondary structure prediction [25]. Support vector machines allow combination in a third way, using multiple kernels; this is also called intermediate integration [16]. Kernel functions basically measure similarity between data instances and a single learner can combine separate kernels for different data sources, instead of combining data before training a single learner (as in early integration) or combining decisions from multiple learners (as in late integration).
Intermediate integration was used for protein location prediction and protein function prediction tasks, respectively, by combining kernels applied to different representations such as protein sequences, hydropathy profile, protein interactions, and gene expressions [26,27]. This method is also used in glycan classification by combining different tree kernels [28].
Our work has four aspects: (1) Introduction of new protein residue features: The temperature factors of the back-bone and side-chain carbon atoms (B-factor) that reflect the thermal mobility/flexibility of the mutated residue; the local packing information in a higher resolution than that has previously been incorporated by considering the side-chain atoms as well; amino acid substitution likelihoods from PAM250 matrix. (2) Implementation of three different machine learning approaches (early, late, and intermediate integration), two of which, namely late and intermediate, have not been used before in the computational prediction of protein stability change. (3) Comparison of classification and regression methods. (4) The use of a reject option in both classification and regression to check for cases where the learner has low confidence.

Data Sets
The first data set (S1615) was compiled from the data available online [29], originally extracted [9] from the ProTherm database [30]. This data set has been used previously and provides a basis for comparison [2,9,31]. The set originally contains 1615 single-site mutation data from 42 different proteins. Each instance has the following features: PDB code of the protein, mutated position and mutation itself, solvent accessibility, pH value, temperature (T), and the change in the free energy, ΔΔG, due to a mutation in a single position. As there are instances for the same mutation and position where ΔΔG differs with T and pH values, T and pH are kept as features in our data set. A subset (388 instances) of the training set (1615 instances) was previously used as a test set for comparison between different predictors [2]. Though some studies include the test set also in the training set, we remove it from the training set to have disjoint training and test sets, as done in [2].
We also extract an up-to-date version (as of July 2, 2009) (S2783) that contains 2783 single-site mutations with known PDB code of the protein and ΔΔG values also from the ProTherm database. On this larger data set, we implement and compare both classification and regression integration methods and also their versions with the reject option.

Added Features
The substitution frequency of an amino acid for another is considered here as an additional feature with the Point Accepted Mutation (PAM) matrix [32]. PAM250 is chosen for the score of each amino acid substitution and is based on the frequency of that substitution in closely related proteins that have experienced a certain amount of evolutionary divergence.
Another feature considered is the mobility/flexibility of the amino acid position in a given structure. The B-factors reported in the PDB file is a good and quick indicator of this feature. Neighbors of the mutated residue in both amino acid sequence and 3D structure are the two other features that have been used recently [2,9]. A window size of seven in the sequence [2] and a cutoff distance of 9Å in space was previously used to find the neighbors of the mutated position as the optimum sequence length and distance, respectively [9]. In our implementation, in addition to alpha-carbon atoms (C α ), beta-carbon (C β ) atoms are also considered to reflect the packing at a relatively higher resolution.
A mutation in a position of a protein sequence will change the number of side-chain atoms of the residue in that position. This may trigger a conformational change or local readjustments that may result also in a change in the atomic packing around that residue and the fluctuations of the surrounding residues and the mutated residue itself. Nevertheless, as in other studies [2,9,31], we neglect this effect.
Removing the instances with non-available features and the redundant instances from S1615 leaves us with training and test sets of 1122 and 383 instances with total of 31 and 14 proteins. Stabilizing mutations are 32.35 per cent and 11.49 per cent, respectively. After removing the instances with non-available features, S2783 reduces to 2471 instances from 68 different proteins and 755 of them (30.55 per cent) are stabilizing mutations. Both data sets are available online. Table 1 gives a list of the representations, original features, and the new features that we introduce. The information coming only from the sequence (SO), and the topology of the protein structure (TO), and both (ST) are encoded in the same way as defined in previous studies [2]. An added asterisk, for example, (SO*), denotes the representation with newly added features. Neighbors of the mutated position in the sequence, mutation, T, and pH are encoded in SO/SO*. Sequence information is not used in TO/TO*; instead, spatial neighbors and the solvent accessibility of the mutated position are encoded. In ST/ST*, all information are combined. The substitution likelihood of an amino acid is added to the existing data as a new feature in all three representations. Crystallographic B-factors of the C α and C β atoms are used in TO* and ST*. For discrete features like amino acid identities, 1-of-n encoding is used, that is, if the variable can take one of n different values, one is set to 1 and all others to 0.

The Effect of Adding New Features to the Original Data Sets
To each of the three representations (SO, TO, or ST), the new features are added one at a time and as combinations of two and three (see Table 2). Since all the new features except PAM are structure-related, they are not added to SO. All of the new features, including PAM, are added to both TO and ST. We end up with (2 1 (PAM) for SO and 2 4 (PAM, CB, BFA, BFB) for each of TO and ST combinations) a total of 34 possible feature sets (all of which include the mutation, T, and pH information).

Performance Assessment
Having already left 383 test instances out as the test set for S1615, we use 20-fold cross-validation (cv) on the 1122 training instances using 19/20 = 95 per cent for training proper and 5 per cent for validation. The best cross-validation strategy, that is, the number of folds, gets the best trade-off between the total amount of computation and training set size. With k folds, one needs k sets of training and validation and uses (k -1)/k of data for training. We decided that the best is with k = 20; with higher k (or using jackknife), there is too much computation and with smaller k, training set gets small and variance increases. Classes should be represented in the right proportions when subsets of data are held out, not to disturb the class prior probabilities and we fulfill this requirement by stratification. Repeating training 20 times, we choose the hyperparameter set that has the highest average validation accuracy. The 20 classifiers trained on the 20 training folds for that hyperparameter set are tested on the test set. If we are required to perform classifier combination, we use the same training and validation sets also for the combiner due to the small size of the training set [33].
For all three integration methods, we use our own code; MOSEK [34] is used for solving the optimization problems of support vector machines. We report averages over 20 test results obtained by testing the trained classifier of each fold on the test set; for comparing classifiers, we use the paired t-test over these 20 results.
We use a slightly different methodology for S2783 because we train both classification and regression methods. First, we determine 3 split points for both stabilizing and destabilizing mutations as shown in Figure 1. Each split contains approximately the same number of data instances as the other two splits of the same class. This splitting mechanism both maintains stratification and ensures that we give the regressors training instances with diverse output values. Then, we take one-third of each split randomly to the test set and the remaining two-third is reserved as the training set. We apply 20-fold cv on the training set and obtain 20 folds. The learners (both classifiers and regressors) are trained on the 20 training folds and tested on the test set. The hyperparameter set that has the highest average validation accuracy for classification or the lowest mean square error for regression is selected and tested on the test set 20 times with the trained learners. This whole process is replicated 10 times each time using a different random test set. As a result, we obtain 10 × 20 test set results and report the average of these results.
The accuracies on the test set are calculated as given in Table 3 where TP, FP, TN, and FN, respectively refer to the number of true positives, false positives, true negatives, and false negatives. Precision, recall, and FP rate are evaluation measures which give information about the reliability of the predictor. The same measures are also reported for regression methods after converting the output of the regressor to a class prediction by looking at the sign.
As we can see from Figure 1, ΔΔG values are clustered around zero and small changes in the prediction of a learner may change the predicted label for a test instance. When the risk of misclassification is high, we can allow a predictor to give a reject decision. We define a risk matrix in Table 4 where r is the reject option, and the rows and

SO ± 3 neighbors (± 3 NE)
In all three representations, amino acid substitution likelihood is used as a feature. B-factors of the C α and C β atoms and spatial neighbor determined using both C α and C β atoms are features introduced into TO and ST. The abbreviations are given only for the features that we add.
columns correspond to the true and predicted class labels, respectively.
Predicting the class label correctly does not incur any cost at all. If the learner rejects, a unit cost incurs. If the learner makes a prediction error, it pays a misclassification cost λ for FN and αλ for FP where α is the trade-off parameter for FP and usually depends on the application. These misclassification costs should be larger than 1 in order to make the learner reject when it is not confident about its prediction. Given a risk matrix and P(+|x), we can calculate the risks of three possible actions as follows: and the best action is selected as the action with minimum risk. One can then solve for the rejection thresholds based on the values of λ and α. For example, if λ = 2 and α = 2, then we choose

# R e p r e s e n t a t i o n P AM
The new features to each of the three representations (SO, TO or ST) are added one at a time and as combinations of two and more. The original features are already given in Table 1 and are not shown here.
Distribution of S2783 data over the free energy change due to single-site mutation, ΔΔG Figure 1 Distribution of S2783 data over the free energy change due to single-site mutation, ΔΔG. The regions separated by dashed lines are used to obtain similar training and test splits. Random one-third of the instances in each region is reserved for testing and the remaining two-third is used in training.  For regression where the output is not a probability but a number, we can not analytically solve for the two thresholds but need to do an exhaustive search. We search for two thresholds θ 1 (< 0) and θ 2 (> 0) on the validation sets given the values of λ and α that minimize the total classification risk. We choose the negative class if the regression output, y, for a specific test instance is less than θ 1 , reject if θ 1 < y <θ 2 , and choose the positive class if y >θ 2 .

Early Integration
Different classifiers make different assumptions about the data and may fail in different instances [14]. We train three classifiers, namely k-nearest neighbor estimator, decision tree, and support vector machine, using SO/SO*, TO/TO*, and ST/ST* representations. We use a single regression method, namely support vector regression, on all representations.

k-Nearest Neighbor (k-NN) Classifier
The k-NN classifier assigns the input to the class by taking a majority vote among its k neighbors. The best value of k is chosen from the set of 1, 3, 5, 7, 9, and 11 using 20-fold cv.

Decision Tree (DT)
A DT is a hierarchical model whereby the local region is identified in a sequence of recursive splits. When there is noise, growing the tree until it is purest, we may grow a very large tree. To alleviate such overfitting, tree construction ends when nodes contain few examples; this threshold, τ, is the hyperparameter to be tuned. τ parameter is selected from the trial values of 56 (5 per cent of the training set), 28, and 14 for S1615 (80, 40, and 20 for S2783).

Support Vector Machine (SVM)
SVM finds the linear discriminant in the feature space with the maximum margin [35]. SVM uses the training data in the form of dot products and allows embedding another feature space via kernel functions. The RBF (radial basis function) kernel was recently reported to work best for stability prediction [2]. The regularization parameter, C, is chosen from (0.01, 0.1, 1, 10, 100) and the kernel width, γ, is chosen from (0.25r, 0.5r, r, 2r, and 4r) where r is the average nearest neighbor distance over the training set.

Support Vector Regression (SVR)
SVR is an extension to SVM for regression problems [36]. The regularization parameter, C, is chosen from (0.01, 0.1, 1, 10, 100) and the width parameter of the RBF kernel, γ, is chosen from (0.25r, 0.5r, r, 2r, and 4r) where r is the average nearest neighbor distance over the training set, the regression tube width, , is selected from (0.05, 0.10, 0.15).

Late Integration
It is possible to learn to combine the decisions of classifiers by a combiner classifier. By training the three classifiers described above with 34 data sets (see Table 2), we get 102 different combination triplets of (R.D.B) outputs where R, D, and B stand for Representation, Data set, and Baselearner. The output of the combiner is the best subset combination of these 102 triplets. The two criteria to select the best combination are accuracy and diversity, in that, we want (R.D.B) triplets that fail in different regions of the input space. In order to see to what extent any two classifiers are correlated, McNemar's test is used [15]. The same procedure can also be applied to combine regressors. We obtain 34 different regressors and the combiner chooses a subset from those. The correlation coefficient between the output of two regressors can be used to check the diversity between these regressors; a small correlation coefficient means that the two regressors are diverse.
The algorithm for selecting the most accurate and most diverse (R.D.B) triplets is a greedy, forward algorithm for subset selection. We start with an initial (R.D.B) that is the most accurate and search through the rest of the (R.D.B) triplets for those that are different from the initial one at significance level of α = 0.05 by McNemar's test. We add the most accurate one among those and iterate until there is no further improvement. The posterior probability outputs of the selected classifiers are then used to train a combiner that is an SVM with the linear kernel. The pseudocode of the algorithm is given in Table 5. The algorithm for combining regressors is very similar to Table 5 except three basic differences: (1) We select the regressor with the minimum mean squared error among candidate regressors. (2) We use correlation coefficient as the diversity measure between regressors. ( | ) . puts of selected regressors with a combiner that is an SVR with the linear kernel.

Intermediate Integration
When using multiple kernels in support vector machines, there are two different possibilities [16]: We can calculate kernel functions on different representations or calculate different kernel functions on the same representation.
One can take a sum over different kernels and summation rule is applied successfully in computational biology [37] where heterogeneous data sets exist by the nature of the biological problems.
Replacing the kernel function with a weighted summation of p kernel functions was proposed [38,39]: where the combination weights (η m ) are new parameters optimized in training. In addition to the flexibility of constructing weighted combination rules, using multikernel SVMs provides two important advantages: (1) Information can be extracted about the classification task at hand. The feature sets used in kernel functions with larger weights are understood to give more relevant information in terms of classification. For example, obtaining information about important features in biological problems such as disease diagnosis and drug development is as important as classification accuracy.
(2) Kernel functions with zero weights can be eliminated. If such feature sets are obtained by using costly and time consuming experimental procedures, this decreases the overall complexity and cost.
For regression using intermediate integration, we use a variant of the localized multiple kernel learning model [40]. Kernel combination weights can be modeled by using the softmax function as follows: where the softmax guarantees that η m ≥ 0 and , and u m are the kernel-specific parameters we need to learn. These parameters are optimized during training in an iterative manner.
In intermediate integration, we combine RBF kernels over feature subsets that form SO/SO*, TO/TO*, and ST/ST*. Their width parameters are selected as the average nearest neighbor distances in the corresponding feature subsets.

S1615 Data Set Early Integration
We finetune the hyperparameters by inspecting the 20fold cv misclassification error.   extra features for early integration are given in Table 6. Table 7 lists the precision, recall, and FP rate values on the test set for the best classifiers for all three representations.

Late Integration
For k-NN, we choose k = 5 to give more informative posterior probabilities, rather than 0/1 decisions, to the combiner in late integration.  Table 5. When the outputs of the (R.D.B) triplets are given to the SVM Combiner, the average accuracy is 0.903 on the test set and 0.847 on the validation set (see Table 8). This accuracy is comparable to the reported values in previous studies [2,7,31]. Similarities between selected (R.D.B) triplets calculated by McNemar's test are given in Table 9.

Intermediate Integration
The test results for all data representations are given in Table 10. We can see that adding PAM to SO does not change the accuracy because PAM is assigned zero weight; but adding extra features to TO and ST increase the average accuracy by 4.6 per cent and 6.0 per cent, respectively; both improvements are statistically significant. The highest accuracy is obtained with TO* (0.879), which however is significantly less than 0.909 of early integration.
The kernel weights can be used to assess the relative importance of features (see Table 11). In all three repre-  Accuracy of the best (R.D.B) triplets in early integration for each representation of the S1615 data set  sentations, each feature subset except PAM has a combination weight (η m ) greater than zero. The original representations have meaningful features for classification. The weights also show that ± 3 neighbors in the sequence carry as much information as ± 1 and ± 2 neighbors. In the modified representations (SO*, TO*, and ST*), the new weights indicate that the added features, except PAM, carry information for the stability of a protein.
Local spatial composition with C α and C β (CB) has larger weight than C α (CA), which highlights the contribution of side-chain packing to stability. Also, the information that reflects the extent of mobility/flexibility of each C α (BFA) and C β (BFB) have nonzero weights, implying that they are informative.

Overall Comparison of Integration Methods
To be able to compare the three integration methods, in all three representations, we chose the version that has the highest average validation accuracy and compared the three. The ones chosen are given in Table 12 that shows the averages and standard deviations of validation and test accuracies. According to 20-fold paired t-test on the test results, there is no significant difference between early and late integration; both are significantly more accurate than intermediate integration.  Table 13. We see that SVM and SVR clearly outperform k-NN and DT by improving accuracy more than 1.5 per cent in all three representations. When we look at the effect of adding the new features to the original representations for SVM and SVR, we see that the new features do not change the test accuracy very much. The precision, recall, and FP rate values on the test set are also listed for SVM and SVR in Table 14, where we see that though SVM and SVR have comparable accuracies, SVR almost halves the FP rate, for example on ST*, it reduces from 0.078 to 0.040.

Late Integration
First, 102 classifiers trained on S2873 data set are combined with the procedure explained in Table 5. We obtain the average accuracy as 0.832 on the test set and 0.830 on the validation set (see Table 15). Then, we combine 34 regressors trained, the average test set accuracy is 0.827 and the average validation accuracy is 0.819. Again, we see    that in terms of accuracy, SVM and SVR are comparable, though the latter has higher precision and lower FP rate.

Intermediate Integration
The test results for all data representations using multikernel SVM and SVR are given in Table 16. When we use multikernel SVM, we can see that adding extra features does not change accuracy. The highest accuracy is obtained with ST (0.806), which however is less than 0.832 of early integration. Using extra features in multikernel SVR does not help increase the accuracy either. The best accuracy performance is obtained with TO as 0.797.
When we look at Tables 17 and 18, we can say that the added features carry information for predicting the energy change for single-site mutations even though they do not improve the average testing accuracy. As in S1615 data set, local spatial composition with C α and C β (CB) has larger weight than C α (CA) and the information that reflects the extent of mobility/flexibility of each C α (BFA) and C β (BFB) has nonzero weights.

Classification with Reject Option
We also perform simulations with reject option both for classification and regression, and give the performance measures obtained with early integration using SO (see Tables 19 and 20), late integration (see Tables 21 and 22), and intermediate integration using TO* (see Tables 23 and 24), respectively. We see that increasing λ and α values increases the accuracy of predictors and decreases FP rate at the cost of rejecting some instances. The selection of λ and α values is of crucial importance and depends on the loss incurred for making wrong decisions. Figures 3  and 4 show FP rate and rejection rate values for all integration approaches using SVM and SVR with the tried (λ, α) pairs. We see that using late integration for SVM case generally gives lower rejection rate than early and intermediate integration for a given FP rate; SVR can attain much lower FP rate but needs to reject more.

Discussion
We focus on the protein stability change prediction by adding new features and implementing the three different integration approaches, classification vs. regression, the effect of the reject option.

Sufficiency of the Data Sets
Training any classifier with an unbalanced data set in favor of negative instances makes it difficult to learn the positive instances. The unbalanced distribution in prior probabilities of the two classes in both training and test sets affects the reliability of the predictor in all integration approaches. Nevertheless, the abundance of one class remains with the nature of the stability problem. Stabilizing mutations are far less than destabilizing mutations. Higher accuracies might be achieved with balanced training and test tests. For example, the test sets of S1615 and S2783 data sets have 88.51 per cent and 69.45 per cent destabilizing mutations, respectively. S1615 data set does not have balanced training and test sets whereas we evenly distribute stabilizing and destabilizing mutations to training and test sets for S2783 data set. For S1615 dataset, we     [9]. Even though the ΔΔG values are not provided to the classification algorithm numerically, the error rate is higher for smaller changes and lower for larger ones. This may be due to two reasons: Either our predictor works best at dramatic stability changes; or possible experimental errors, being more significant for smaller ΔΔG values than the larger ones, confused our predictor. In separating the mutations into two distinct classes as positive and negative, the prediction may be ambiguous for data points close to zero. If we test our best classifier for S1615 data set with the test instances outside of this interval (230 of 383 instances), we obtain 0.948 test accuracy. This last result shows the advantage of introducing a reject option and the approach we use by taking into account the losses of rejects and wrong decisions is the systematic way to choose the optimal thresholds.
Furthermore, the mutations in the test set of S1615 data set were conducted in physiological conditions [2], having T in the range 20-30°C and pH in the range 6-8 whereas for the training set, the ranges are 0-86°C and 1-11 respectively. It is not ideal to train a learner with data within a wide range and test it only in a limited region; it is normally expected that the training and test sets follow the same probability distribution. In S2783 data set, the test data and the training data are split randomly to alleviate this problem. Because we do the splitting ten times and take the average, our results are more robust on S2783 data set.

Integration Approaches
The most accurate predictor in early integration for S1615 (S2783) data set is SVM (SVM) classifier trained with ST* (SO) achieving a validation set accuracy of 0.842 (0.829) and a test accuracy of 0.904 (0.832). We see in Tables 6 and 13 that using structural information is useful with k-NN and DT; adding new features such as PAM and CB improve cv accuracy, and in the case of TO*, also improves test accuracy using SVM, though not significantly. It may be said that TO does not have enough packing information intrinsically and using B-factors and C β may help.
In late integration for S1615 data set, of the four triplets combined, two are SVM, one is DT and one is k-NN. The fact that four different learners are chosen show that the learning algorithm is a good source of diversity. Of the four, two use ST* and SO*, showing again that in terms of representations, there is also diversity for higher accuracy. Note that this diverse set is found automatically by the selection algorithm we use.
The most accurate intermediate integration version for S1615 data set uses TO* with all new features; its test accuracy is 0.879, which is significantly more accurate than the version with old features only (TO) with test accuracy 0.833. Though it is not as accurate as the other integration methods, intermediate integration has the advantage of knowledge extraction through weights assigned to features. The kernel weights (see Tables 11,13,17,and 18) show that when the protein structure is available, CA and CB are always preferred as a more valuable information source than any other features including sequence neighbors. Based on the kernel weights, we can say that stability change is mostly a structure-driven phenomenon: For example, when we sum up the weights of structural fea-

Prediction Using Only the Amino Acid Sequence
We analyze simulation results to see how accuracy changes if we have only the sequence information. For both data sets, the best performance in early integration is obtained with (SO. ORIGINAL.SVM). The average test accuracies are 0.904 and 0.832 for S1615 and S2783 data sets, respectively. Intermediate integration for S1615 data set achieves 0.872 average testing accuracy with SO, which is higher than those of TO and ST (0.833 and 0.818, respectively). With the extra features, the accuracies are 0.872, 0.879, and 0.878 for SO*, TO*, and ST*, respectively (see Table 10). Distribution of the correctly classified (grey) and misclassified (black) instances of S1615 data set after the SVM combiner over the free energy change due to single-site mutation, ΔΔG  Tables 19, 21, and 23). Another target can be achieving a specific FP rate. In this case, for example, early and late integration reject 10 per cent of the test instances and intermediate integration rejects 35 per cent of the test instances to get a FP rate less than 0.05. The same behavior can also be observed for regression (see Tables 20,22,and 24).

Comparison with Other Studies
Our methodology using 20-fold cv has comparable accuracy to previous studies [2,7,9,41]. S1615 data set is based on Protherm that has been also used by those studies. Nevertheless, it is not exactly the same data set as we remove the test set from the training set, thus we represent our comparison with this caveat in Table 25. Early integration approach is used in all referred works. They all report the performance of their predictors based on k-fold cv, also including the test set in cross-validation. The highest accuracy reported so far is 0.930 evaluated on a subset of the training data [9]; our early integration has the accuracy of 0.904 on the independent test set. In those studies, higher accuracies are reported in the presence of structural information, which is in agreement with our findings though the difference is not significant in our case. Ours is the first study that compares early, intermediate, and late integration to incorporate knowledge from different data sources for the problem of predicting protein stability, also analyzing the effect of different types of sequence and structural information.

Conclusion
In protein stability prediction, we investigate three approaches for combining multiple representations/ learners, namely, early, intermediate, and late integration. These approaches can be used in both classification and regression. Early integration uses a single learner to combine multiple inputs whereas late integration combines the decisions of learners using different inputs. Intermediate integration combines inputs at the kernel level. We find that early and late integration are significantly more accurate than intermediate integration and intermediate integration allows knowledge extraction in the sense that it can pinpoint the features that are useful and how much they contribute. One advantage of combination is that if a new feature set, a kernel or a method for learning is proposed (using machine learning or some other approach), it is always possible to include it among the set we use and thereby improve accuracy even further.
In general, we would expect early integration to suffer more from the curse of dimensionality when many input sources are concatenated. Late integration combines decisions and therefore is expected to be more robust; the disadvantage would be the need to train/store/use multiple learners. Intermediate integration is in between these two extremes where separate features are not used in a raw manner (as in early integration) nor are decisions extracted from them (as in late integration) but are converted to similarities (using kernels) and fed to a single learner. The relative weights of features can be measured using intermediate integration. Of course, ours is a single  [41]. † A subset of the training set that was previously used in training. ‡ Filtered from the set of 1615 mutations [9]. Machine learning method, data set, performance assessment are the main features to be compared. (Seq: Sequence-based information, Seq+Str: Sequence-and structure-based information) study and further research is needed before one can explain with enough confidence where and why each integration method works the best. Of the three which one should be chosen depends on the application and other criteria, such as how much time and space can be afforded.
We see that in terms of accuracy there is no significant difference between interpreting this as a classification or regression problem except that a regressor tends to have a lower FP rate. We also conclude that introducing a reject option is useful to reject cases where a classifier or a regressor is not confident; it allows achieving a much lower FP rate taking into account the loss incurred for rejections and misclassifications.
As a future direction, we can add features, for example, to reflect the side chain conformation change due to a singlesite mutation by a simple modeling.