- Research article
- Open Access
Prediction of protein motions from amino acid sequence and its application to protein-protein interaction
BMC Structural Biology volume 10, Article number: 20 (2010)
Structural flexibility is an important characteristic of proteins because it is often associated with their function. The movement of a polypeptide segment in a protein can be broken down into two types of motions: internal and external ones. The former is deformation of the segment itself, but the latter involves only rotational and translational motions as a rigid body. Normal Model Analysis (NMA) can derive these two motions, but its application remains limited because it necessitates the gathering of complete structural information.
In this work, we present a novel method for predicting two kinds of protein motions in ordered structures. The prediction uses only information from the amino acid sequence. We prepared a dataset of the internal and external motions of segments in many proteins by application of NMA. Subsequently, we analyzed the relation between thermal motion assessed from X-ray crystallographic B-factor and internal/external motions calculated by NMA. Results show that attributes of amino acids related to the internal motion have different features from those related to the B-factors, although those related to the external motion are correlated strongly with the B-factors. Next, we developed a method to predict internal and external motions from amino acid sequences based on the Random Forest algorithm. The proposed method uses information associated with adjacent amino acid residues and secondary structures predicted from the amino acid sequence. The proposed method exhibited moderate correlation between predicted internal and external motions with those calculated by NMA. It has the highest prediction accuracy compared to a naïve model and three published predictors.
Finally, we applied the proposed method predicting the internal motion to a set of 20 proteins that undergo large conformational change upon protein-protein interaction. Results show significant overlaps between the predicted high internal motion regions and the observed conformational change regions.
A protein molecule is not a rigid body. The scale of protein motions is very broad: motions range from local fluctuations such as those seen in loop regions to global ones involving changes in the relative position of rigid domains. Flexible regions and linkers connecting rigid regions are often observed in large proteins. Flexible regions are often necessary for proteins to perform their specific biological functions [1–4], e.g. by enabling proteins to adjust their conformations in response to external stimulation. Such stimuli can include the binding of a ligand or a change of the surrounding environment. Structural flexibility is therefore an important characteristic that must be examined to understand proteins.
When we specifically examine motions of a protein backbone segment in ordered structures, the movement is theoretically classified into two forms: internal and external motion . The former is a deformation of the segment itself, but the latter involves only translational and rotational motions of the segment. In the external motion, the segment fluctuates as a rigid body by changing dihedral angles of the flanking residues. For this reason, it is considered that the internal and external motions fundamentally differ (Additional file 1: Figure S1). It is expected that the distinction between these motions will provide new insights into the relation between structural flexibility and its function .
Actually, NMR provides a powerful experimental technique to analyze protein dynamics at the atomic and molecular levels . Particularly, NOEs and relaxation experiments provide information related to picosecond-microsecond motions of the backbone atoms [8–10]. Model-free analysis enables quantitative determination of fluctuation and slow conformational change (i.e. millisecond order) of the backbone amide vector [11, 12]. The latter motion is assumed to be related to internal motion, as described above. Although NMR provides a detailed view of protein dynamics, it is time-consuming.
In contrast, computational methods are useful to calculate the dynamics of proteins for which structures are available. One method is to compare structures of a protein crystallized under different conditions or different conformers of NMR. Structural differences show a flexible region [13–15]. Another computational method is to simulate protein dynamics. Among several methods, Normal Mode Analysis (NMA) provides a straightforward means of calculating the dynamics from its structure. Although NMA is less CPU-intensive than other computer simulation methods such as Molecular Dynamics (MD), Monte-Carlo (MC) simulation, and Framework Rigidity Optimized Dynamics Algorithm (FRODA)/Floppy Inclusions and Rigid Substructure Topography (FIRST) software [16, 17], it can detect concerted motions of clusters of atoms and support discussion of their motions for elucidation of their functions [18–21]. Using NMA results, Nishikawa and Go examined internal and external motions of secondary structure , and Ishida et al. studied a subtilisin-eglin c complex to explore internal and external motions of enzymes and inhibitors .
With the increasing number of available protein structures and the development of high-performance computers, databases of protein dynamics have been constructed. In fact, i GNM  and ProMode  are databases of protein motion analyzed that respectively use a Gaussian Network Model (GNM) and NMA. Another database, MolMovDB , presents numerous graphical representations including motions of loops, domains, and subunits. In addition, DynDom  provides domain, hinge axes, and hinge bending residues in proteins determined from two different conformations of the same protein.
Recently, web-based tools for predicting internal motion have been developed. For example, FlexOracle  and HingeProt  predict hinge regions in proteins. Furthermore, DFprot  predicts main-chain deformability, which corresponds fundamentally to the internal motions described above.
However, all these tools, along with NMA, offer only limited practical use because they require knowledge of the three-dimensional structure information of a protein. Recently, a few prediction techniques that address protein motions using only amino acid sequence information have been proposed. Of those, ASP  and the Protein Continuum Secondary Structure Predictor  identify conformational switches in proteins using secondary structure information. FlexPred also predicts ordered conformational change in the protein backbone using information of sequence neighbors, evolutionary conservation, and solvent accessibility [31, 32]. These definitions of protein motions are similar to those of internal motion, as described above. The support vector machine-based predictor, Wiggle, predicts functionally flexible regions defined using a coarse-grained-protein dynamic modeling approach .
In this paper, we present a novel method for predicting internal and external motion in ordered structures. The proposed method is based on the Random Forest (RF) algorithm using information associated with the adjacent paired amino acid residues and a predicted secondary structure. The method presents the advantage of enabling prediction of protein motions using amino acid sequence information alone as the input. The proposed method exhibits moderate correlation between predicted internal and external motions with those calculated by NMA: the respective correlation coefficients are 0.525 and 0.597. To investigate the possibility that the proposed method detects flexible regions related with protein function, we applied it to 20 proteins that undergo large conformational change upon protein-protein interaction. The results revealed, in 85% of the proteins studied, overlaps between the predicted high internal motion region and observed conformational change region.
Results and Discussion
Herein, we describe our demonstration of the relation between B-factor derived from X-ray crystallographic studies and internal/external motion. We then present the proposed algorithm and the experimental evaluation. Finally, we applied the proposed method to a set of 20 proteins that change their conformations when interacting with other molecules.
Thermal motion and internal/external motion
The B-factor determined in X-ray crystallographic studies is often used as an indicator of thermal motion. However, B-factors include both thermal motions and static deformation attributable to crystal packing and other causes. We first analyze the relation between thermal motions assessed according to the X-ray crystallographic B-factor and internal/external motions determined computationally using NMA. To this end, we calculated the correlation between the amino acid frequencies with Z-scores higher than one (i.e., residues with large motions; Figure 1). We observed that, although the overall correlation coefficients between motions as assessed using the B-factor and internal and external motions were 0.769 and 0.945, respectively, the frequencies of several amino acids differed considerably between the B-factor and both internal and external motions. Comparison of internal motion and B-factors using a t-test (p < 0.01 ) shows that the frequencies of charged amino acids E, D, K, and R are low in internal motion, whereas those of bulky or hydrophobic amino acids V, I, F, Y, and L are high (Figure 1B). Therefore, the thermal motion, as assessed using the B-factor, and internal motion have some definite discrepancies. A similar trend was apparent for the external motion and B-factor, but the difference is smaller (Figure 1C). From comparison of the actual NMA values, the external motion is shown to resemble the thermal motions.
Length distribution of regions with a high internal/external motion score
We investigated the length distribution of consecutive amino acids with normalized internal and external motion scores calculated using NMA larger than one (Figure 2). For both motions, the frequencies of high-mobility regions decreased as the region length increased. Results showed 3 and 19 regions longer than 21 residues long, respectively, with high internal and external motions. We noted a peak in the distribution of external motion at around nine. This observation suggests that the external motion included the short segment, which fluctuates as a rigid body such as a short helix (Additional file 1: Figure S1). According to these observations, we created two kinds of predictive models for external motion, although only one prediction model was developed for internal motion. For external motion, one model (external_short) used short flexible regions (≤ 9 residue length) as the dataset; the other model (external_long) used longer flexible regions. The final prediction result of external motion is obtained from their combination (details are presented in the Methods section).
Construction of prediction method
We developed a prediction method for predicting internal and external motion based on Random Forest (RF) , which is a kind of supervised classification algorithm. The proposed method uses information about properties of the local sequence neighborhood and predicted secondary structure to predict the degrees of protein motion for a given sequence. Therefore, it is necessary to determine an optimal size of the local sequence neighborhood and suitable prediction methods that assign residues to the secondary structure (ss) and the accessible surface area (ASA). In this study, we tested two types of predictors. Amino-acid propensity based predictors constitute one type: PHD  and RVPnet  were chosen, respectively, for predicting ss and ASA. The others are profile-based predictors: psipred  and sable . To assess the prediction accuracy of the proposed methods, three criteria were chosen: the mean absolute error (MAE), the correlation coefficient between prediction scores and normalized NMA scores (CC), and the area under the ROC curve (AUC). Their respective details are discussed in the Methods section.
We first investigated the influence of the window size and two structural information predictors on prediction accuracy (Figure 3). For internal motion, the largest gain in performance was observed when the window size was 11 residues. The optimized window sizes were the same even if structural information predictors differed. However, differences were observed between the prediction accuracies of the two methods. The prediction performance was higher when the profile-based predictors were used. In contrast, the best performance for predicting external motion was observed when the window size was 17 residues, which is larger than the window size of internal motion.
We compared the prediction score with normalized NMA score to assess the predictive performance. In general, protein motion for each residue tends to be large in a coil or loop region and small in a secondary structure. Based on this concept, we created a naïve model, which predicts the score that reflects the degrees of protein motions. In the naïve models, first, the secondary structure was predicted using PHD or psipred. Then, the magnitude of protein motion for each residue was calculated as follows. If the i-th residue is located in the secondary structure (ss), then
where avg. and SD respectively signify the average and standard deviation of the normalized NMA score in the whole dataset. Actually, 6.80e-6 and 7.49e-3 were used, respectively for avg. of internal and external motion; 1.00 and 1.01 were used for SD, respectively. The other region has no secondary structures.
We also performed comparison with three published methods that predict a region with protein motion, although their definitions of protein motion differed. First, we chose the B-factor predictor because the amino acid frequencies of the external motion and the B-factor are similar (Figure 1). In this case, we selected PROFbval, which predicts normalized backbone B-values . Second, we selected disordered region, which are defined as a region lacking a stable three-dimensional structure. Although it differs from internal/external motion in terms of lacking ordered structures, they are considered to possess high flexibility. Consequently, the comparison of the proposed method with disordered region predictors is worth investigating. For predicting disordered regions, we used POODLE-S  in this study. Third, we specifically examined a region that is involved in a conformational switch. These regions can switch from one folded conformation to another, which is similar to the definition of internal motion in this study. To predict them, we chose FlexPred , which predicts residue positions that might be involved in conformational switches in ordered structures. Three predictors use only amino acid sequence information for prediction.
The prediction results for three proteins are presented in Figure 4. For aminoglycoside 6'-N-acetyltransferase (Figure 4A) and pheromone-binding protein (Figure 4B), some high-mobility regions can be predicted correctly. However, some peaks were predicted incorrectly for tachylectin-2 (Figure 4C). Next, the MAE and CC of two proposed predictors were estimated by performing five-fold cross-validation tests. The proposed methods, which implemented psipred and sable, yielded the lowest scores in the averages of MAE s and the highest scores in the average of CC s among all prediction methods for both motions (Table 1). Furthermore, in three kinds of AUC that changed the threshold value, the proposed methods exhibited the best performance among them except for the threshold value of -1 for external motion (Table 1 and Figure 5). Although the naïve model assigns the same high scores equally to residues located in all loop regions, the proposed method assigns different scores to residues with dynamics.
Therefore, the proposed method exhibited higher performance than the naïve model. Both PROFbval and POODLE-S showed higher prediction accuracy for external motion than for internal motion. In addition, the respective differences of the prediction accuracy between the proposed method for predicting external motion, and PROFbval and POODLE-S are smaller than the difference between the proposed method for predicting internal motion, and PROFbval and POODLE-S. Those results indicate that the character of the B-factor and disordered regions resembles that of external motion, as discussed above. It is noteworthy that the distribution of CC varied widely (Figure 6), ranging for internal and external motion from -0.185 to 0.865 and from -0.478 to 0.905, respectively. Although both MAE and CC increased roughly in relation to the margin size (Additional file 2: Figure S2), similar conclusions can be obtained for any margin.
Relation between prediction accuracy and structure classification
In this study, we developed a novel method for predicting internal and external motion in ordered structures solely using amino acid sequence information. Although the respective performances of the proposed methods were higher than those of naïve model and three kinds of published predictors, some room exists for improving their prediction accuracies because each standard deviation of their CC is high (Table 1). To elucidate the cause, we investigated the prediction accuracy according to four structural groups defined by SCOP hierarchy: all-α protein, all-β protein, α/β protein, and α + β protein (Figure 7). For internal motion, the all-α protein group exhibited the highest score in the average of CC s among the four groups, whereas the all-β protein group was the lowest. Similarly, the average of MAE s for all-β protein exhibited the highest value, which indicates poor predictive ability for its group. The prediction accuracy differed significantly among the combination of all groups and all β protein group, according to results of a Steel-Dwass test (p < 0.01). From the viewpoint of secondary structure, a similar tendency was observed: the average of MAE s for β-sheet is higher than that for α-helix (Additional file 3: Table S1). Therefore, the prediction result for internal motion in the proposed method was not good for tachylectin-2, which has an all-β structure (Figure 4C). On the other hand, no significant difference was found in the prediction accuracy for external motion among the four groups. Cases for which the proposed method incorrectly predicted high scores in β-sheet were often observed when predicting internal motion for all-β proteins. We use sliding windows with lengths of 11 and 17 residues, respectively, for predicting internal and external motion. Therefore, information about relations with two amino acids in remote positions of the sequence was ignored in this study. Consequently, it is readily conceivable that an amino acid positioned in β-sheet does not include sufficient information to represent its state in a three-dimensional structure in comparison with that positioned in α-helix because two remote amino acids of a sequence often interact in the β-sheet. Internal motion is deformation of the segment. Therefore, it is considered that the motion is influenced by the environment in its structure. The information of remote amino acids of a sequence is therefore necessary along with information of an adjacent amino acid of a sequence for predicting internal motion. In contrast, information of an adjacent amino acid of a sequence is thought to be more important for predicting external motion because external motion involves translational and rotational motion as a rigid body by the flanking deformed residue. Therefore, the prediction accuracy of internal motion for all-β proteins is thought to be low, although no difference is found in the prediction accuracy of external motion.
Features related with protein motion
The features associated with prediction of internal and external motion in the proposed method can be evaluated according to their influence on prediction accuracy for a decreasing number of variables involved in prediction models. The RF can estimate the importance of variables more simply than commonly used machine learning methods such as SVM . We investigated the relation between the prediction accuracy and the model with fewer variables using the Gini index as an indicator (Additional file 4: Figure S3).
Then we chose and analyzed the model with the minimum number of variables among models showing almost equal prediction accuracy to that of the model with all variables, called the original model (Figure 8). From comparison with the original model, most variables relating to structural information, namely, secondary structure and ASA, remained in all models. The ratio that they occupied in their model was higher than that in the original model. When we observed features that ranked in the top 20 based on the Gini index, ASA occupied over half of features in five of six models (Additional file 5: Figure S4). On the other hand, the ratios of variables relating to amino acid properties, namely physicochemical property (physicochem.) and protein mobility propensity (mobility), were low and different, as judged using the downsized model. These results show that the structure information indicating where an amino acid is located in its structure influences CS for both motions especially. However, structural information alone is insufficient to discriminate protein motion; PS for both motions and RS for external motion require information related to amino acid properties.
These observations showed that two kinds of information--that related to secondary structure and ASA--are highly associated with two protein motions. Therefore, when a more sophisticated secondary structure and ASA predictor, such as a profile-based predictor (psipred and sable), were used in the proposed method, the prediction accuracies of these versions were slightly higher than those of versions using a amino-acid propensity based predictor (PHD and RVPnet) (Table 1).
Application to protein-protein interaction
Although we aim to predict internal and external motion in ordered structures in this study, as one attempt to investigate the potential for the proposed method to detect protein motion associated with protein function, we compared the predicted high protein motion region with the observed large conformation change region upon protein-protein interaction. These protein motions are regarded as associated with their protein function because they are observed along with protein-protein interaction. For the experiment, we prepared a set of 20 proteins that undergo large conformation change upon association (> 2Å C α RMSD) created by Dobbins et al., with which they demonstrated the relation between normal mode fluctuations and conformational change  (Table 2). We compared the internal motion with observed conformational change region because it was defined as a deformation of a segment itself in this study. To begin with, we present three kinds of typical analysis results. Specifically, the observed conformation change regions are located in a binding site, hinge region, and other regions. Secondly, we discuss the overall results.
Ecotin, a homodimeric protein, is an inhibitor of a group of homologous serine proteases such as trypsin, chymotrypsin, and elastase. One dimeric inhibitor binds to a protease molecule. From comparison of two structures under different crystalline environments, an inherent flexible loop was identified in the binding site with trypsin. It was necessary for its inhibitory function . The proposed method predicted high internal motion on the corresponding loop (Figure 9A). FlexPred predicted a specific conformational switch region on it (Additional file 6: Figure S5A).
(II) Fab fragment
Fragment antigen binding (Fab fragment) is a region on an antibody that binds to antigens. It is a heterodimer of the heavy and light chains in each of the two composed domains. When Fab binds to hemagglutinin derived from a Flu virus, it was observed that the hinge region between two domains changed their conformation. That hinge movement agreed with DynDom definition, which determines the dynamic domain and hinge axes from two protein structures . The proposed method predicted high internal motion at the hinge region in each chain (Figure 9B). PROFbval predicted high B-factor on the terminal of hinge region in L chain (Additional file 6: Figure S5B).
Erythropoietin (EPO) is a hormone produced primarily in the kidneys. It has four-helical bundle topology with two long loops; it is bound to the extracellular domain of the EPO receptor. The CD loop located in the region remote from the binding site changed its conformation . Additionally, 15N NMR relaxation data revealed a region (Leu112-Thr132) that has intrinsic flexibility . The proposed method predicted high internal motion on the corresponding loop (Figure 9C). POODLE-S predicted its loop to be disordered region (Additional file 6: Figure S5C).
From results obtained by application of the proposed method to a set of 20 proteins, three or more consecutive residues with predicted score higher than one were extracted. Then, they were regarded as candidates for the conformational change regions. A comparison between the observed conformational change region with predicted high internal motion region revealed at least one overlap between them in 85% of the proteins studied (Table 3, Additional file 7: Table S2 and Additional file 8: Figure S6). If the analysis object was limited to the 16 proteins that undergo conformational change for only one partner, then overlap was observed in 15 proteins (94% of the proteins studied), excepting hirustatin. The internal motion was not predicted for hirustatin because it is a small protein (55 residues). For FlexPred, which predicts conformational switches in proteins, overlaps were observed in only six proteins. Although the definition of the internal motion in this study and conformational switches in FlexPred are similar, the proposed method can detect more observed conformational change regions upon protein-protein interaction than FlexPred can. The proposed method covers all positive results of FlexPred. Both PROFbval and POODLE-S respectively predicted high scores on conformational change regions of 10 and 9 proteins. These observations suggest that the proposed method is sensitive for detection of protein motions related to protein-protein interaction, especially proteins that change conformation for a specific target.
We presented a novel method for predicting internal and external motions in ordered structures based on the RF algorithm using amino acid information alone. The proposed method uses two pieces of information for prediction: the adjacent paired amino acid residues and predicted secondary structure information. The method presents the advantage of prediction using only amino acid sequence information as an input. Consequently, the method is applicable to all sequences. The proposed method exhibited the possibility of detecting protein motion related with protein-protein interaction.
Calculation of internal and external motions
For this study, NMA was performed using FEDER/2 [23, 44]. The NMA was conducted for the energy-minimized conformation using Protein Data Bank (PDB) data as a starting conformation. In an NMA, a mean-square displacement of atom a, in the thermal fluctuations is given as the sum of contributions from individual modes
where D ak is a displacement vector of the atom a in the k-th normal mode and N is the number of dihedral angles used as independent variables.
We consider two conformations for a segment (we considered a segment of nine residues in this study) in each normal mode. One is the minimum-energy conformation, around which the molecule is fluctuating. The position vector of atom a in this minimum-energy conformation is . The other is an instantaneous fluctuating conformation, in which only the k-th normal mode is excited to the root-mean-square thermal amplitude. The position vector of the atom a in this distorted conformation is + D ak . We bring this distorted conformation to the best-fitted position with the minimum-energy conformation purely by translational and rotational motions. The displacement vector of the atom a by this purely translational and rotational motion is designated as ; the residual one is designated as . Then, D ak is decomposed as
Superscripts e and i respectively signify external and internal. The mean square deviation of the atom a is given as
The third term on the right-hand side of this equation was usually found to be much smaller than 1% of the first two terms in our results. Therefore, the mean-square deviation of the atom a is decomposed approximately into external (first term) and internal (second term) ones. The magnitudes of external and internal motions of a segment, and , are defined respectively as averages of and over constituent atoms in the segment. As described herein, we are interested in the main-chain fluctuation. For simplicity, we consider only Cα atoms in this decomposition (meaning that we selected data for the Cα atoms from results obtained using NMA with a full-atom model).
The dataset was created by selecting protein chains from ProMode  as described below. First, proteins with fewer than 100 residues were removed. Then proteins whose root mean square deviation (RMSD) between the energy-minimum structure and PDB structure is more than 2Å were excluded. For proteins with common SCOP id, we selected only one of them . Furthermore, multi-domain proteins defined by SCOP were excluded. Next, some proteins were discarded so that maximum pairwise sequence identity was limited to 25%. The resulting dataset comprised 481 chains (87,236 residues).
We calculated internal and external motions using NMA with a full-atom model for all proteins in the dataset. Decomposed atomic fluctuations to internal and external motions calculated by NMA values, , were normalized respectively to correct for the variation among the proteins in the dataset as
where M denotes the protein length, and and s respectively signify the sample mean and sample standard deviation calculated for each chain. Furthermore, yi is designated as the normalized NMA score of internal or external motions of the i-th residue.
Variables used for encoding sequences
In this study, we created a prediction method that uses paired amino acid information (Figure 10). A protein sequence was encoded using a sliding window whose size is optimized for exhibiting the highest prediction performance (Figure 3; see the Results and Discussion section for details). Internal motion uses an 11-residue window. Therefore, the 920 variables (= 92 paired features × adjacent amino acids, see below for details) are defined for amino acid pairs of the central amino acid with the other 10 amino acids in the window. In contrast, the number of variables is 1,472 for external motion because the window size is 17 residues. The value of variable was given as a sum of the value of feature (defined below) of the central amino acid and paired amino acid. The value of feature was set to either 1 or 0; the value of variable can take a value of 0, 1, or 2. The five and seven residues at both termini were, respectively, excluded for internal and external motion because the value of variable is assigned to the central residue of the window.
Features used for building variables
We defined 18 features, which were divided into four groups designated as physicochem., mobility, ss, and ASA. A value of feature of an amino acid was set to one if the amino acid satisfied a feature's definition, and to zero otherwise.
The first group was derived from physicochemical features (physicochem.) of amino acids defined according to Zvelibel et al. . This group included seven features: hydrophobic, polar, aromatic, aliphatic, small, positive, and negative. For example, FYWH has an aromatic ring. Therefore, the value of the "aromatic" feature of Y is one (Figure 10B). For the variables, 7 × 7 = 49 pairs of these seven features are considered. The second group was derived from protein mobility propensity (mobility). This group included six features: high_mob_high, high_mob_low, nor_mob_high, nor_mob_low, low_mob_high, and low_mob_low. In determining these features, every amino acid in the dataset was classified into either the high, normal, or low mobility group defined according to normalized NMA scores which were, respectively, higher than 1, between -1 and 1, and lower than -1. Then the protein mobility propensity (Prop(n,g)) was defined as
where freq(n,g) respectively represent the relative frequencies of amino acid n in protein mobility group g (= high_mob, normal_mob, and low_mob), g _high = 1 or g _low = 1; otherwise, they are set to zero. For example, because Prop(Y, normal) is higher than one standard deviation from the average of the normal group, the value of the "nor_mob_high" feature for amino acid Y is one (Figure 10B). The meaning is that the protein mobility propensity in the normal group for Y is high. For the variables, 6 × 6 = 36 pairs of these six features are considered.
The third group was associated with the secondary structure (ss), as predicted by PHD or psipred. This group includes three features: helix, sheet, and other. If an amino acid was predicted to be in a helix region, then the value of the "helix" feature of this amino acid was one; the two other features, sheet and other, were zero (Figure 10B). For the variables, it is possible to consider 3 × 3 = 9 pairs of the three features, but we considered only three combinations in this study: helix-helix, sheet-sheet, and other-other.
The fourth group was associated with the accessible surface area (ASA) predicted using RVPnet or sable. This group includes two features: "exposed" and "buried". If an amino acid's predicted ASA value was less than 11, then the value of a "buried" feature was one; it was zero otherwise when using RVPnet. Similarly, if the predicted ASA value was higher than 27, then the value of the "exposed" feature was one (Figure 10B). Two parameters are set so that RVPnet exhibited its highest performance. When using sable, two parameters were set to one and three. For the variables, 2 × 2 = 4 pairs of the two features are considered. In summation, we consider 92 pairs of features (49, 36, 3, and 4 for four groups as described above) for each of 10 adjacent amino acid for internal motion, which results in 92 × 10 = 920 variables (Figure 10A).
Distributions of the normalized NMA score (y i ) are dependent on the secondary structure type. Therefore, in this study, three kinds of models were created: one for each motion based on the idea that the degrees and tendency of mobility in proteins depend on the secondary structure. We therefore defined three categories of window locations according to the location of the central residue: center of a secondary structure (CS), remote area from secondary structures (RS), and periphery of secondary structures (PS) (Figure 11A). Hereinafter, CS means that the central residue was located in a secondary structure and three or more residues distant from either secondary structure terminus. In addition, RS signifies that the central residue was located in the other region, except for secondary structures, and three or more residues separate from both termini of the other region. Also, PS means the central amino acid was not located in either CS or RS. Furthermore, each category was divided into three classes based on the degree of the normalized NMA score: flexible, intermediate, and rigid classes (Figure 11B). A window was assigned to the intermediate class if the normalized NMA score of the central amino acid in the window was within one standard deviation from the mean. Similarly, a window was assigned to the flexible and rigid classes if the normalized NMA scores were, respectively, higher and lower than the intermediate classes.
For this study, a secondary structure was assigned to an amino acid using prediction results from either the secondary structure predictor, PHD or psipred. The amino acids predicted to form α-helix or β-sheet are considered to be located in a secondary structure. Briefly, we divided the amino acids into three categories of window locations, CS, RS, and PS. In each category, they were divided further into three classes, flexible, intermediate, and rigid.
The RF algorithm was used to build a prediction model for classifying amino acids into the three classes of flexible, intermediate, and rigid. In this study, we implemented RF using the R package (randomForest 4.5-22) . For internal and external motions, three RF prediction models were trained respectively for the three categories of window locations, CS, RS, and PS. Three parameters (mtry, ntree, and nodesize) in RF used default values because RF is robust against changes in its parameters. In fact, the prediction accuracies of the case in which the default values were used for the three parameters were almost identical to those of the case in which optimal values were used (Additional file 9: Figure S7). The results of the RF prediction model classified windows into the three classes; their prediction results were attributed to the central residue in the window. Then, the results of classification obtained from RF were converted to a score based on the average and standard deviation of the normalized NMA score in each category (Additional file 10: Table S3). For example, if a window assigned to the flexible class and the Z score of the normalized NMA score is three, then the score of the flexible class in CS of internal motion is 2.509 (= -0.248 + 0.919 × 3), and its value is given to the amino acid assigned to the flexible class in CS. The score reflects the degrees of protein motion. The score was smoothed using a five-residue moving average. Two models were used for the external motion because the length distribution of the high mobility region with NMA normalized score > 1 indicated a peak near the nine-residue-long segments (see the Results and Discussion section: Figure 2). One model (external_short) used the short flexible region (≤ 9 residue length) as the dataset of the flexible class; the other model (external_long) used a longer flexible region in the flexible class. The final prediction result of external motion combined the results of both models if the external_long model predicted a long region (> 9 residues) in the flexible class.
The prediction results were assessed on a residue basis: the predicted score in the sequence was compared to the normalized NMA score. The predicted scores were allowed to have a margin. In this study, we set the margin value of ± 0.2 because the minimum difference of the average of normalized NMA score between adjacent classes was 0.225 (Additional file 10: Table S3).
In this work, the error function used the mean absolute error (MAE), which was defined as the absolute difference between two values. Actually, MAE was calculated as
where x i and y i respectively represented the score obtained from the proposed method of i-th residue and the normalized NMA score; M denotes the protein length. The MAE value approaches 0 as the prediction improves.
Furthermore, to assess the method's performance, we calculated the correlation coefficient (CC) between two datasets as
In fact, CC ranges from -1 to 1; a large positive value represents a positive correlation. In this study, it means that the patterns of the normalized NMA score and the predicted score are very similar. We also used a Receiver Operating Characteristic (ROC) curve as another assessment criterion by classifying the prediction results into flexible and rigid classes. The amino acid was assigned to the flexible class when the normalized NMA score was higher than the threshold value. The ROC curve was obtained by plotting the false positive rate (R FT ) against the true positive rate (R TP ). The R TP is defined as the percentage of windows of the flexible class correctly predicted as flexible class over all positives (sum of true positives and false positives). Similarly, the R FP is defined as the percentage of windows of the rigid class incorrectly predicted as flexible class over all negatives (sum of true negatives and false negatives). The R FP against R TP was shown, while the score increased from -2 to 4 with a 0.01 increment. The amino acid is considered to be predicted as a flexible class if a predicted score is higher than the score. The larger the area under the ROC curve (AUC), the more robust an algorithm is. An area of 1.00 is considered a perfect predictor.
Kirsch JF, Eichele G, Ford GC, Vincent MG, Jansonius JN, Gehring H, Christen P: Mechanism of action of aspartate aminotransferase proposed on the basis of its spatial structure. J Mol Biol 1984, 174(3):497–525. 10.1016/0022-2836(84)90333-4
Faber HR, Matthews BW: A mutant T4 lysozyme displays five different crystal conformations. Nature 1990, 348(6298):263–6. 10.1038/348263a0
Sampson NS, Knowles JR: Segmental movement: definition of the structural requirements for loop closure in catalysis by triosephosphate isomerase. Biochemistry 1992, 31(36):8482–7. 10.1021/bi00151a014
Zhang M, Tanaka T, Ikura M: Calcium-induced conformational transition revealed by the solution structure of apo calmodulin. Nat Struct Biol 1995, 2(9):758–67. 10.1038/nsb0995-758
Nishikawa T, Go N: Normal modes of vibration in bovine pancreatic trypsin inhibitor and its mechanical property. Proteins 1987, 2(4):308–29. 10.1002/prot.340020407
Ishida H, Jochi Y, Kidera A: Dynamic structure of subtilisin-eglin c complex studied by normal mode analysis. Proteins 1998, 32(3):324–33. 10.1002/(SICI)1097-0134(19980815)32:3<324::AID-PROT8>3.0.CO;2-H
Williams RJ: NMR studies of mobility within protein structure. Eur J Biochem 1989, 183(3):479–97. 10.1111/j.1432-1033.1989.tb21076.x
Chi Y, Kumar TK, Chiu IM, Yu C: 15N NMR relaxation studies of free and ligand-bound human acidic fibroblast growth factor. J Biol Chem 2000, 275(50):39444–50. 10.1074/jbc.M007205200
Chill JH, Quadt SR, Anglister J: NMR backbone dynamics of the human type I interferon binding subunit, a representative cytokine receptor. Biochemistry 2004, 43(31):10127–37. 10.1021/bi049606g
Gitti RK, Wright NT, Margolis JW, Varney KM, Weber DJ, Margolis FL: Backbone dynamics of the olfactory marker protein as studied by 15N NMR relaxation measurements. Biochemistry 2005, 44(28):9673–9. 10.1021/bi050149t
Lipari G, Szabo A: Model-free approach to the interpretation of nuclear magnetic resonance relaxation in macromolecules. 1. Theory and range of validity. J Am Chem Soc 1982, 104: 4546–59. 10.1021/ja00381a009
Lipari G, Szabo A: Model-free approach to the interpretation of nuclear magnetic resonance relaxation in macromolecules. 2. Analysis of experimental results. J Am Chem Soc 1982, 104: 4559–70. 10.1021/ja00381a010
Shatsky M, Nussinov R, Wolfson HJ: Flexible protein alignment and hinge detection. Proteins 2002, 48(2):242–56. 10.1002/prot.10100
Ye Y, Godzik A: Database searching by flexible protein structure alignment. Protein Sci 2004, 13(7):1841–50. 10.1110/ps.03602304
Qi G, Lee R, Hayward S: A comprehensive and non-redundant database of protein domain movements. Bioinformatics 2005, 21(12):2832–8. 10.1093/bioinformatics/bti420
Kovacs JA, Chacon P, Abagyan R: Predictions of protein flexibility: first-order measures. Proteins 2004, 56(4):661–8. 10.1002/prot.20151
Wells S, Menor S, Hespenheide B, Thorpe MF: Constrained geometric simulation of diffusive motion in proteins. Phys Biol 2005, 2(4):S127–36. 10.1088/1478-3975/2/4/S07
Go N, Noguti T, Nishikawa T: Dynamics of a small globular protein in terms of low-frequency vibrational modes. Proc Natl Acad Sci USA 1983, 80(12):3696–700. 10.1073/pnas.80.12.3696
Levitt M, Sander C, Stern PS: Protein normal-mode dynamics: trypsin inhibitor, crambin, ribonuclease and lysozyme. J Mol Biol 1985, 181(3):423–47. 10.1016/0022-2836(85)90230-X
Tama F, Sanejouand YH: Conformational change of proteins arising from normal mode calculations. Protein Eng 2001, 14: 1–6. 10.1093/protein/14.1.1
Li G, Cui Q: A coarse-grained normal mode approach for macromolecules: an efficient implementation and application to Ca(2+)-ATPase. Biophys J 2002, 83(5):2457–74. 10.1016/S0006-3495(02)75257-0
Yang LW, Liu X, Jursa CJ, Holliman M, Rader AJ, Karimi HA, Baha r I: iGNM: a database of protein functional motions based on Gaussian Network Model. Bioinformatics 2005, 21(13):2978–87. 10.1093/bioinformatics/bti469
Wako H, Kato M, Endo S: ProMode: a database of normal mode analyses on protein molecules with a full-atom model. Bioinformatics 2004, 20(13):2035–43. 10.1093/bioinformatics/bth197
Echols N, Milburn D, Gerstein M: MolMovDB: analysis and visualization of conformational change and structural flexibility. Nucleic Acids Res 2003, 31: 478–82. 10.1093/nar/gkg104
Lee RA, Razaz M, Hayward S: The DynDom database of protein domain motions. Bioinformatics 2003, 19(10):1290–1. 10.1093/bioinformatics/btg137
Flores SC, Gerstein MB: FlexOracle: predicting flexible hinges by identification of stable domains. BMC Bioinformatics 2007, 8: 215. 10.1186/1471-2105-8-215
Emekli U, Schneidman-Duhovny D, Wolfson HJ, Nussinov R, Haliloglu T: HingeProt: automated prediction of hinges in protein structures. Proteins 2008, 70(4):1219–27. 10.1002/prot.21613
Garzon JI, Kovacs J, Abagyan R, Chacon P: DFprot: a webtool for predicting local chain deformability. Bioinformatics 2007, 23(7):901–2. 10.1093/bioinformatics/btm014
Young M, Kirshenbaum K, Dill KA, Highsmith S: Predicting conformational switches in proteins. Protein Sci 1999, 8(9):1752–64. 10.1110/ps.8.9.1752
Boden M, Bailey TL: Identifying sequence regions undergoing conformational change via predicted continuum secondary structure. Bioinformatics 2006, 22(15):1809–14. 10.1093/bioinformatics/btl198
Kuznetsov IB: Ordered conformational change in the protein backbone: prediction of conformationally variable positions from sequence and low-resolution structural data. Proteins 2008, 72: 74–87. 10.1002/prot.21899
Kuznetsov IB, McDuffie M: FlexPred: a web-server for predicting residue positions involved in conformational switches in proteins. Bioinformation 2008, 3(3):134–6.
Gu J, Gribskov M, Bourne PE: Wiggle-predicting functionally flexible regions from primary sequence. PLoS Comput Biol 2006, 2(7):e90.. 10.1371/journal.pcbi.0020090
Breinman L: Random Forests. Machine Learning 2001, 45: 5–32. 10.1023/A:1010933404324
Rost B: PHD: predicting one-dimensional protein structure by profile-based neural networks. Methods Enzymol 1996, 266: 525–39. full_text
Ahmad S, Gromiha MM, Sarai A: Real value prediction of solvent accessibility from amino acid sequence. Proteins 2003, 50(4):629–35. 10.1002/prot.10328
Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292(2):195–202. 10.1006/jmbi.1999.3091
Adamczak R, Porollo A, Meller J: Combining prediction of secondary structure and solvent accessibility in proteins. Proteins 2005, 59(3):467–75. 10.1002/prot.20441
Schlessinger A, Yachdav G, Rost B: PROFbval: predict flexible and rigid residues in proteins. Bioinformatics 2006, 22(7):891–3. 10.1093/bioinformatics/btl032
Shimizu K, Hirose S, Noguchi T: POODLE-S: web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix. Bioinformatics 2007, 23(17):2337–8. 10.1093/bioinformatics/btm330
Dobbins SE, Lesk VI, Sternberg MJ: Insights into protein flexibility: The relationship between normal modes and conformational change upon protein-protein docking. Proc Natl Acad Sci USA 2008, 105(30):10390–5. 10.1073/pnas.0802496105
Shin DH, Song HK, Seong IS, Lee CS, Chung CH, Suh SW: Crystal structure analyses of uncomplexed ecotin in two crystal forms: implications for its function and stability. Protein Sci 1996, 5(11):2236–47. 10.1002/pro.5560051110
Cheetham JC, Smith DM, Aoki KH, Stevenson JL, Hoeffel TJ, Syed RS, Egrie J, Harvey TS: NMR structure of human erythropoietin and a comparison with its receptor bound conformation. Nat Struct Biol 1998, 5(10):861–6. 10.1038/2302
Wako H, Endo S, Nagayama K, Go N: FEDER/2: program for static and dynamic conformational energy analysis of macro-molecules in dihedral angle space. Comp Phys Comm 1995, 91: 233–51. 10.1016/0010-4655(95)00050-P
Hubbard TJ, Ailey B, Brenner SE, Murzin AG, Chothia C: SCOP, Structural Classification of Proteins database: applications to evaluation of the effectiveness of sequence alignment methods and statistics of protein structural data. Acta Crystallogr D Biol Crystallogr 1998, 54:(Pt 6 Pt 1):1147–54. 10.1107/S0907444998009172
Zvelebil MJ, Barton GJ, Taylor WR, Sternberg MJ: Prediction of protein secondary structure and active sites using the alignment of homologous sequences. J Mol Biol 1987, 195(4):957–61. 10.1016/0022-2836(87)90501-8
Liaw A, Wiener M: Classification and Regression by randomForest. R News 2002, 2: 18–22.
We thank members of molecular function team in Computational Biology Research Center (CBRC), of Kuroda's group at TUAT, and Dr. Hiroyuki Toh (Kyushu University) for helpful discussion. This work was supported by PharmaDesign, Inc..
SH designed the prediction algorithm, performed the evaluation experiments, analyzed protein-protein interactions, and wrote the manuscript. KY provided the guidance for developing and evaluating the algorithm. YK provided a critical review and edited the manuscript. HW and SE executed NMA to calculate internal and external motion. SK provided the guidance for developing the algorithm. TN provided the guidance for evaluating experiments and analyzing protein-protein interaction. All authors contributed to the research, and mutually discussed the results and manuscript, and approved the manuscript.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Hirose, S., Yokota, K., Kuroda, Y. et al. Prediction of protein motions from amino acid sequence and its application to protein-protein interaction. BMC Struct Biol 10, 20 (2010). https://doi.org/10.1186/1472-6807-10-20
- Random Forest
- Protein Motion
- Internal Motion
- Accessible Surface Area
- Mean Absolute Error