How to quantify the information exchange within a protein?
In Shannon's theory of information transduction along noisy channels, where the level of noise corresponds to the fraction of the input characters that may be wrongly translated to output characters, information is quantified as the degree of certainty obtainable about the output signal, once a particular input is given [15, 17]. A random channel that produces for any input signal a random output has a maximum degree of uncertainty, whereas for a noiseless channel, which produces an exact copy of the input at the output, there is no uncertainty about the output. This variation in the degree of uncertainty is expressed in terms of entropy, allowing for the immediate application of information theoretical principles to protein dynamics simulations in a straightforward manner [16]. Specifically, we focus here on the notion of mutual information to express the mutual dependence between the conformational states of residues in a protein, as a quantification of the information flow within. For more information on information theory we refer to [17]. As a tool, mutual information, and other co-variation measures, has been applied to identify co-varying or co-evolving residues in multi-sequence alignments [31, 32]. Yet, as far as we are aware, no work exists that applied this informational theoretical measure to determine the conformation coupling between residue sidechains.
So when do two sidechains exchange information? Two linked residues are dependent when knowing one conformation will convey information about the conformation of the other. In this case, the two residues share information. For instance, consider two amino acid residues in a protein, each having some conformational variation in their side chains. The mutual information shared by these residues is defined by the entropy reduction that is observed at the second residue when the conformation of the first is fixed in an arbitrary configuration (and vice versa). Thus, when ligand binding conformationally restricts the first residue then mutual information will quantify how much of the signal of ligand binding is received at the second residue since its conformational flexibility will also be restricted. Mutual information between two residues is thus zero when the conformational state of one residue does not provide any information about the conformational state of the other residue. On the other hand, mutual information is maximal when each conformational state of a residue's sidechain uniquely defines the conformational state of the other residue's sidechain. In accordance with information theory, the conformational space of each residue corresponds to the residue's alphabet.
Exploration of the conformational space of the amino acids
The calculation of mutual information between any two amino acids in a protein requires reliable statistics of the conformational dynamics of all amino acids in the native structure of the protein (see Methods for details). The method relies on computing both the probability of residue i to adopt an arbitrary conformation independently and the joint probability of finding residue i and residue j in an arbitray combined combination. This is achieved via effectively sampling the conformational states of each amino acid in a protein structure so that all possible combinations of conformations of residues i and j can be explored. It is clear that the computational size of such a sampling problem can only be addressed by reducing the number of possible conformations that each amino acid can adopt to a relatively small number of discrete states. To achieve this, we treat the sidechain and backbone flexibility of each amino acid as separate components of its overall conformational dynamics and consider only a finite number of states for each.
In order to keep the number of sidechain conformations computationally tractable, it is common to employ a discrete, finite-size alphabet for each residue, called rotamer library. Such rotamer libraries are constructed by extracting from a database of high resolution protein structures the most frequently occurring states and thus the degree of coarse graining that is employed to record the statistics imposes a resolution on the data. For the purpose of detecting conformational dependencies between neighboring residues in a protein structure, we require a finer resolution than is provided by common rotamer libraries used for homology modeling and we thus constructed a database of backbone dependent sidechain conformations with a 10 degree resolution on the sidechain dihedral (Chi-) angles (see Methods). Given the backbone dihedral angles of a certain residue in the protein structure, a list of possible sidechain conformations and their probabilities can be retrieved from this database. The number of chi angles in the residue determines the size of this list, meaning that residues with small sidechains tend to have shorter lists than residues with long sidechains. The conformational state of the sidechain in the protein structure serves as a starting point for generating these lists.
Information concerning the flexibility of the backbone of a protein can be obtained from either molecular dynamics simulations (MD) or NMR data. Exploration of the conformational space accessible to the protein backbone using MD is a computationally expensive method. Moreover, Vendruscolo and co-workers have recently shown that MD simulations yield more realistic results when restricted by experimental data in the form of residue-residue distances derived from Nuclear Overhauser Effects (NOEs) in a Nuclear Magnetic Resonance (NMR) experiment [33, 34], suggesting that unrestricted MD simulations are not the most effective method to sample the backbone ensemble. The work of Vendrusculo and others also revealed that the backbone dynamics can essentially be captured by a small number of representative backbone structures. Moreover, Bahar and colleagues recently showed [35] that NMR models can also be viewed as an ensemble of conformations accessible under physiological conditions and that even though the RMSD values reflect the uncertainties in the coordinates, they also contain physically meaningful contributions of equilibrium fluctuations [35]. Although this point is still under debate, we here found that sampling the sidechain conformations of each amino acid on the collection of backbones present in the NMR dataset using a Monte Carlo approach (see Methods) yields results that are consistent with experimental studies as well as more exhaustive simulations performed by Kuriyan et al [9]. More specifically, we employed a Metropolis algorithm implemented in the FoldX force field [36, 37] from which an equilibrium distribution of sidechain conformations, compatible with a given protein backbone structure, is obtained (see Methods for details). The force field scores the conformations taking into account the packing interactions.
The sidechain sampling results for a single backbone structure obviously introduce a strong bias in the apparent information flux towards residues whose sidechains arbitrarily happen to be strongly coupled in this particular backbone conformation. When information is obtained from an ensemble of related backbone conformations however, each backbone introduces slight variations in the pattern of residue-residue couplings. As a result, the combination of the sidechain sampling on the entire ensemble of backbone structures acts as a filter that removes sporadic couplings while accumulating consistent couplings, thereby revealing the true network of information exchange between all residues. We have here taken the entire ensemble of backbone structures in the NMR datasets of the Fyn kinase SH2 domain as an adequate sample and have not systematically explored if a reduced number of backbones could be employed to the same effect.
Construction of the residue-residue information network in Fyn SH2
Since the Monte Carlo sampling directly yields the equilibrium distribution of sidechain conformations observed at each position along the protein backbone, the quantity of mutual information between all residue pairs in the protein can be calculated using probability and entropy calculations (see Methods and [17]). In this way, information transfer over both short and long distances in the folded protein is elucidated, explicitly revealing the communication between all residues in the protein. Extracting only the changes in mutual information that result from the difference between bound and unbound states of the SH2 domain (see Methods), highlights the change in communication patterns. Hence, those amino acids that experience the strongest changes in mutual information can be detected and mapped on the network of sidechain interactions in the protein, thereby revealing how series of dynamically coupled residues in the topology of the protein direct the overall communication.
In order to obtain the network of residue-residue couplings in the Fyn SH2 domain, we employed a FoldX based sidechain sampling on the backbone structures of this SH2 domain as determined by NMR on the protein domain in isolation and bound to its phosphopeptide ligand (pdb identifiers 1AOU and 1AOT [38], see Methods). The NMR ensemble was determined using over 90% of the structurally non-redundant nuclear Overhauser effects (NOEs), so that the ensemble can be considered to possess adequate precision and accuracy [39–41]. In order to ensure reliable statistics, we collected over 551 thousand samples of sidechain configurations from approximately 275 million simulation steps over the 22 backbone models present in the NMR ensemble, producing the probabilities of finding the residues' sidechain in a particular conformation. From all these probabilities mutual information between all residue pairs was derived.
To capture how the information exchange within the SH2 domain changes due to ligand binding we require NMR structural data on both the bound and unbound state of the protein. Since there is no structural data on the unbound state of Fyn SH2 available publically, we assume here that the backbone flexibility for bound and unbound state can be derived from the ensemble stored in the 1AOU dataset by energy minimization (see Methods). This simplification has as result that large conformational changes in the backbone of the SH2 domain are not taken into account. Yet since domains do normally not experience large structural changes, this simplification may still provide viable results.
Changes in mutual information link residues at long distances
Figure 2A shows for each residue couple the calculated change in mutual information upon phosphopeptide binding. As the figure shows, most residues do not experience a big change in mutual information (light blue regions in matrix. This fact is also visualized in Figure 2B, where we show the distribution of information change. Most changes in the information exchange between residue pairs falls in the interval [-0.5,0.5] bits. Since this set is almost not affected by binding, we will refer to this collection of residues as the silent group (for instance a serine at position 23, referred to as Ser23, near the end of the αA-helix in Figure 1B and 1D) and all values within this interval are considered to be noisy, meaning that a clear signal is difficult to obtain when the value of change falls in this range. Clearly, as a result of binding, all residues experience some (small) change. Yet, those residues that are more strongly affected are considered to be more relevant for understanding the information exchange inside the structure. As can be observed, much less residues experience strong effects and they correspond either to a (drastic) increase in coupling (green to red colors) or a (drastic) decrease in coupling (dark blue). These residues will be called the informative group. For instance HisβD4 at position 60 (residue in phosphotyrosine binding cavity, see Figure 1C) is one residue in the informative group whose coupling with all other residues increases significantly upon peptide binding. We also observe that, in case of the SH2 domain only a few residues seem to experience uncoupling as a result of phosphotyrosine binding: Tyr89 (see Figure 1C). Even though, little uncoupling is observed in Fyn SH2, this does not mean that significant uncoupling may not occur for other protein models, especially in the event of larger, binding-induced allosteric changes. The mutual information calculations introduced here can thus, in principle, capture very different mechanistic scenario's.
To visualise the mechanistic implications for our model system a clustering algorithm (see Methods) was applied on the mutual information matrix in Figure 2A and mapped on the structure of Fyn SH2. Figure 2C displays the residue cluster for which the mutual information increases most upon peptide binding (≥1.1 bits). Only those pairs of residue whose mutual dependence is above this threshold are accepted in the cluster, meaning that they form a complete graph where the weight of each link is above the threshold value. On the one hand, and as expected, these are composed of residues directly interacting with the phosphopeptide itself (ThrEF1, ThrEF2, HisβD4 and TyrβD5) but interestingly also consists of residues on the other side of the domain from which the SH2-SH3 and SH2-kinase linkers emerge (Pro27 and Pro103 according to PDB 1AOT). Even though previous studies of the Fyn SH2 domain [38] did not assign an important role to the conserved HisβD4, our analysis suggests this residue experiences a very strong change in mutual information upon ligand binding with all other residues in the structure (Figure 2B, yellow to red colors). This strong effect may be the result of the displacement of the helix αA in this particular SH2 domain, making the pTyr-binding cavity narrower when the peptide is bound [38]. Most affected seems to be residue ThrEF1.
Thus, our analysis reveals a strong coupling between the peptide binding site of Fyn SH2 and its SH3-SH2 and SH2-Kinase linkers. These first results show that the sampling approach discussed here provides a direct way to identify and quantify highly coupled groups of residues. Moreover, it confirms the notion that subtle changes in structural dynamics can effectively couple residues at distal locations in the structure. Note that the clustering does not show how this information exchange actually occurs, i.e. it does not provide a causal explanation. It provides an identification of the residues that may be involved with signaling. Yet, the idea is that when, by lowering the threshold, a coherent collection is obtained, possibly identifying a consecutive path that links the binding region to other parts of the domain structure.
Closer inspection of the information exchange for some residues
To understand the key elements of information exchange in proteins we compared the change in mutual information of both silent and informative residues with the rest of the protein. We will here illustrate our analysis by two examples that recapitulate the key features of both informative as well as silent residues. Ser23 will serve as an instance of the collection of silent residues whereas HisβD4 represents a prominent member of the informative residues. Specifically, we mapped the change in mutual information of these residues with all the other residues of the SH2 domain onto the structure. Strong changes (ΔI > 1.0) are colored red, weak changes (ΔI < 0.3) blue and intermediate changes are white. In Figure 3A, we observe for HisβD4 that three classes of residues contribute significantly to its score. First, locally interacting residues directly influence each other (e.g. residue Val58). Second, the phosphorylated peptide connects all the relevant residues in the binding pocket, creating a channel by which conformational fluctuations can be transmitted (e.g. residue ThrEF1). As a consequence, these residues are mutually dependent on the state of HisβD4. Third, HisβD4 is also coupled to a number of residues at the other side of the protein. We see in Figure 3A (right) that the residues at the kinase linker (carboxyterminal end of the protein) that were previously detected by clustering are again present, i.e. long-range communication between the histidine and residues Val101, Pro103 and Val86 is clearly visible here. Moreover we see also a coupling with a number of residues related to the SH3-SH2-linker region (Trp at position 7 and Tyr at position 8 in the 1AOT structure), meaning that information is exchanged with both linker regions when switching between unbound and bound state. The results for Ser23 are strikingly different. In Figure 3B, we see that when the phosphorylated peptide becomes attached to the SH2 domain, only a few residues located in the peptide binding site exchange more information with this particular residue. Further, informative and silent residues do not only strongly differ in terms of the amount of residues with whom they are coupled but also in terms of the distances over which they communicate. Figure 4 shows that whereas Ser23 is conformationally isolated, HisβD4 has an extensive network of couplings with both proximate as well as distal residues.
Contiguous communication pathway in Fyn SH2
As argued in previous sections, our information theoretical approach provides a method to quantify the change in exchange of mutual information between all pairs of residues of the Fyn SH2 domain as a consequence of ligand binding. Our analysis revealed that for the majority of residues the conformational coupling to the rest of the protein is not altered upon peptide binding and as a consequence they can be considered silent in terms of signal transduction. A small fraction of residues, however, experience a significant change in conformational coupling and are thus information-rich. Clustering only the most informative of these residues (ΔI ≥ 1.1 bits) revealed a strong coupling between the peptide binding site and the region harboring the linkers connecting the Fyn SH2 to the other domains of the Fyn kinase. It remains of course to be explained by which structural mechanism these distal sites become coupled. As argued earlier, the current framework can only identify and quantify the residues that are involved in the interaction. To identify the causal relationships a different analysis is required.
Several mechanisms for signal transduction have been described in the literature [30] and as such our method does not provide the means to unequivocally single out a given mechanism. However, clustering the informative residues and mapping these on the structure of the protein can provide substantial information. For instance the relative occurrence of coupling versus uncoupling pairs and their disposition on the topology of a protein domain could allow to distinguish pathway models [7, 30] from allosteric models [42].
As our model only displays increase in conformational coupling upon ligand binding, suggesting a pathway model, we here extract the group of residues that define the pathway through clustering (see Methods) of informative residues that have a mutual affinity higher than a particular noise level (0.5 bits). In Figure 5(A–D) we see how the pathway is formed by decreasing the clustering threshold from 1.1 to 0.5 bits: a contiguous dynamic pathway [30] emerges, involving the previously identified residues in the peptide binding site and the linker region.
As discussed above only a subset of residues of the SH2 domain is involved in signal transduction, whereas the other residues seem to be conformationally isolated from ligand binding. Importantly, signal transduction is achieved throughout the core of the SH2 domain and involves the main secondary structure elements of the structure. Hence, we show that the Fyn SH2 domain acts as indivisible information transmission unit propagating information from its binding site over the tertiary structures to the linker regions of the domain.