The observation of evolutionary interaction pattern pairs in membrane proteins

Background Over the last two decades, many approaches have been developed in bioinformatics that aim at one of the most promising, yet unsolved problems in modern life sciences - prediction of structural features of a protein. Such tasks addressed to transmembrane protein structures provide valuable knowledge about their three-dimensional structure. For this reason, the analysis of membrane proteins is essential in genomic and proteomic-wide investigations. Thus, many in-silico approaches have been utilized extensively to gain crucial advances in understanding membrane protein structures and functions. Results It turned out that amino acid covariation within interacting sequence parts, extracted from a evolutionary sequence record of α-helical membrane proteins, can be used for structure prediction. In a recent study we discussed the significance of short membrane sequence motifs widely present in nature that act as stabilizing ’building blocks’ during protein folding and in retaining the three-dimensional fold. In this work, we used motif data to define evolutionary interaction pattern pairs. These were obtained from different pattern alignments and were used to evaluate which coupling mechanisms the evolution provides. It can be shown that short interaction patterns of homologous sequence records are membrane protein family-specific signatures. These signatures can provide valuable information for structure prediction and protein classification. The results indicate a good agreement with recent studies. Conclusions Generally, it can be shown how the evolution contributes to realize covariation within discriminative interaction patterns to maintain structure and function. This points to their general importance for α-helical membrane protein structure formation and interaction mediation. In the process, no fundamentally energetic approaches of previous published works are considered. The low-cost rapid computational methods postulated in this work provides valuable information to classify unknown α-helical transmembrane proteins and to determine their structural similarity. Electronic supplementary material The online version of this article (doi:10.1186/s12900-015-0033-5) contains supplementary material, which is available to authorized users.

genomics [1]. Generally, membrane proteins are poorly soluble and cover a wide intra-cellular concentration range. The inaccessibility of many proteomics methods makes membrane protein analyses still an experimentally challenging field [2]. Hence, the number of known threedimensional structures is relatively small, with 437 nonredundant membrane protein chains currently available [3][4][5]. Consequently, there is a necessity for approaches that allow to predict structural and functional features of unknown membrane proteins. A variety of methods have been developed to predict structural features from sequence, such as α-helical membrane-spanning helices and extra/intra-cellular domains (i.e. TMHMM [6,7], PHDhtm [8], MEMSAT3 [9]) as well as membranespanning β-strands of transmembrane β-barrel proteins (i.e. BOCTOPUS [10]). Furthermore, a major step toward ab initio protein structure prediction has been made through the development of new techniques for mapping energetic interactions in proteins. Here, Lockless and Ranganathan demonstrated [11] a statistical energy function as a good indicator of thermodynamic coupling in proteins. They also showed how sets of interacting residues form connected pathways in the protein fold. An existing basis for efficient energy conduction within proteins has been shown. They called their approach statistical coupling analysis (SCA) that provides the basis for further works in this area. Other approaches dealing in turn with key information to predict protein structures, which can be obtained from homologous sequences and their evolutionary variation because: "The diversity of biologic phenomena arises from the complexity and specificity of biomolecular interactions. Nucleic acid and protein polymers encode and express biologic information through the specific sequence of polymer units (residues). The sequences and corresponding molecular structures are under selective constraints in evolution [12]".
Due to the growth of available protein sequences, many statistical methods have been developed, to compute protein three-dimensional structures from evolutionary context. Diverse contributions were involved to develop sophisticated methods to identify additional key residues that are involved in protein structure and function, especially residues that are strongly conserved within each subfamily but differ between subfamilies [13]. Previous works of Marks et al. [14,15] indicate that rich evolutionary information from genomic sequences can be efficiently mined, leading to information on evolutionary couplings between residues. Morcos et al. [16] have used information about strong constraints on their sequence variability, induced by the three-dimensional structures of homologous proteins. They developed an efficient direct-coupling analysis (DCA) [17,18] implementation to evaluate the accuracy of contact prediction for a large number of protein domains. Later on, Hopf et al. [19] presented a maximum entropy approach to infer evolutionary covariation in pairs of sequence positions of a given protein family. Generated atom models from derived pairwise distance constraints were finally used to predict the full spectrum of protein structures, functional interactions and evolutionary dynamics of unknown threedimensional structures for 11 transmembrane proteins. A novel approach by Kamisetty et al. [20] utilizes an approximation method to obtain more accurate contact predictions for estimating residue-residue contacts in protein structures. Compared to previous methods, higher accuracy was achieved by integrating structural context and sequence co-evolution information. Hence, their method allow more accurate contact predictions from fewer homologous sequences. Furthermore, in genome-wide membrane protein sequence analyses, numerous short conserved sequence motifs were identified [21]. These motifs support the understanding of the features that are important for establishing stability and functionality of the folded membrane protein in the membrane environment. Additionally, as addressed in [22], the analysis of sequence motifs in proteins with similar function or structure might help to identify essential functional sites and locations, which contribute to structural stability. Thus, sequence motif analysis can be helpful for numerous applications, e.g. the investigation of mutant proteins, the understanding of protein dynamics and potential effects of mutagens. During evolutionary progress the spatial structure of proteins is generally stronger conserved than the sequential amino acid composition. Adapted to the field of sequence motif analysis, structure-forming motifs point to their general importance in α-helical membrane protein structure formation and interaction mediation [1]. Moreover, hubs and consecutive motifs with high occurrence in certain membrane protein families can be classified as important for family-specific functional characteristics [23]. Finally, the combination of interaction information and sequence motifs with evolutionary variation can be used for three-dimensional structure prediction.
In our work we obtained key information from homologous sequences to separate and predict membrane protein structures in the context of interacting patterns and their evolutionary variation. Patterns as motif representatives are investigated regarding evolutionary covariation. Interaction information contributes to detect interacting patterns with evolutionary background. Here, we report the development of an algorithm that is involved in the extraction of interaction pattern pairs that are evolutionarily influenced. These were used for the investigation of different mutation types, which are provided by evolution to maintain structure and function. Agreeing with previous works we can state that the evolution provides basic building blocks to maintain structure and function. Related to this, family-specific interaction pattern information were used to predict unknown α-helical transmembrane protein structures. We have also tested our method at an already predicted structure of previous work of Hopf et al. [19]. Finally, our approach is not based on recently developed methods like SCA or DCA, but the processing of interaction and secondary structure data for predicting rich helical structure parts leads to the attachment to previous works.

Methods
In the first step, known crystal structures of α-helical membrane proteins were investigated. Structural information were derived from PDBTM [24]. Currently available known α-helical membrane proteins were assigned to their protein families [25] using Pfam mappings. We have tested our method at two selected families with homologous sequences that contribute to generate coupling statistics (Table 1).

Evolutionary co-variations from pattern alignments (PAs)
Hopf et al. hypothesized and confirmed in their work [19], that the evolution conserves interactions between residues that are important to maintain structure and function. This is done by constraining the sets of mutations that are accepted at interacting sites. To find these constraint interactions within different sequence patterns, we generated PAs using a novel algorithm that detects evolutionary covariation. Aspects of this algorithm are given in this section. However, before elucidating the application of our algorithm, we want to give a short summary on the general definition of short sequence motifs, as well as the aspects of motif detection and information extraction. Consequently, the next steps are involved in motif extraction out of α-helical structures. Like described in previous work of [26] a motif can be written in a generalized, regular expression-like form of XYn, where X and Y correspond to amino acids separated by n−1 highly variable positions. For the general purpose, short sequence motifs have been extracted that contribute to build the α-helical structure in the transmembrane environment. Here, a naive text search algorithm was applied for motif extraction. More precisely, the algorithm mainly utilises a sliding sequence frame strategy. Beginning from the start position of the sequence, different window sizes are used to extract the underlying subsequence. Each subsequence is transcribed into its regular expression XYn. More specifically, at each sequence position i and i + n the algorithm returns the N-terminal residue X and the C-terminal residue Y. Note, that X and Y denote any of the 20 canonical amino acids. Redundant duplications were removed. It is known that amino acids are positioned with an average of 3.6 residues per turn in TM-helices [27] and it is also known that motifs with different length are favoured for TM-helix packing [1,28]. Based on this, the number of n − 1 variable positions ranges within 2 ≤ (n − 1) ≤ max, where max is the maximum helix length of a protein family. Along, for a given protein each motif representative pattern was searched in all helices. If a pattern was found, the initial pattern (IP) is stored. Here, the IP represents the pattern according to which all others are aligned. To detect evolutionary covariation and to minimize the statistical noise, we have aligned patterns from other structures of the same protein-family. We ensured that these patterns, called subwords (SWs), have up to one mutated variable position and a length of n SW ≤ n IP . To avoid redundancy and to minimize computational processing time, already aligned SWs were ignored. Each PA returns possible evolutionary covariation at the variable position of the aligned IP. A representative PA example is shown in Figure 1/Pattern Alignment.

Specific evolutionary interaction pattern pairs (EIPPs)
To close the information gap when individual patterns interact with each other, we have decided to derive interaction data information from a known database. Generally, such databases allow a rapid and simple access to the required data. Helix-helix interaction information were derived from TMPad, the TransMembrane Protein Helix-Packing Database [29]. TMPad is an integrated repository of experimentally determined structural folds derived from helix-helix interactions in α-helical membrane proteins. Here, geometric descriptors of helix-helix interactions, topology, lipid accessibility, ligand and binding sites information are provided by TMPad. Currently, 1,107 protein entries, 4,061 protein chains and 17,413 helix-helix interactions are available. Contact information were enriched by Contacts of Structural Units (CSU) [30] derived from Weizmann Institute of Science, which provides different experimental data after the analysis of inter-atomic contacts of structural units of the protein data base (PDB) [31] entries. Now it is able to create a context between structure and helix-helix interaction information adapted to representative patterns of discriminative sequence motifs. After successfully integration of the TMPad-information to find EIPPs, helix-helix interactions were registered. An Interaction pattern pair was extracted when a contact is given only at a variable pattern position. We have ensured that at least one pattern of a given pair has mutations at the variable position. To obtain a statistical overview about the most occurring interacting motif pairs, the corresponding occurrence was recorded for each XYn − XYn. EIPPs are specific within the investigated membrane protein family. Such pairs can be considered as family-specific signatures due to their responsibility to build and stabilize the proteins structure by taking into account of the evolutionary space. Each EIPP was labelled with the corresponding protein in which the EIPP was found. Pattern interaction networks were created for final visualization and to support the understanding, how the evolution maintains attractive interaction within an EIPP. Furthermore, the existence of family-specific EIPPs Figure 1 The workflow for evolutionary interaction pattern derivation up to final structure similarity determination. A: The main process to derive family-specific EIPP records. This includes the protein data aggregation from known membrane protein structures and the detection of evolutionary covariation based on pattern alignments (PAs). Together with interaction data information from TMPad [29], we obtain interacting patterns with evolutionary background, which are important for maintaining structure and function. B: The evaluation process includes to obtain α-helical sequence information from unknown membrane protein structures using by TMHMM [6,7]. Finally, signature EIPPs can be searched in unknown structures with final structure similarity determination to known structures.
was evaluated by a protein separation task. An evaluation dataset of the investigated Pfam-families PF01036 and PF00230 was derived ( Table 2). Redundancy reduction was performed by assuring the family-specific number of transmembrane helices. Transmembrane helical information were obtained using TMHMM Server v. 2.0 [6,7]. Basically, TMHMM performs a prediction of intra/extracellular regions and integral membrane helices based on sequence. Beside per-residue predictions TMHMM also lists underlying per-residue assignment probabilities as an indicator of prediction uncertainty. TMHMM results do not always exhibit the expected typical number of 7 TMhelices (Bacteriorhodopsin-like protein) and 6 TM-helices (Major Intrinsic Proteins) in the evaluation dataset, which leads to the reduction of the evaluation dataset. Eventually, not all sequences of the evaluation dataset were included in the process. Known structure representatives were also removed.
For the further step, protein clusters consisting of all family representative unknown structures were merged, to form a cloud and subsequently sampled. For each cloud member, family-specific EIPPs were applied on TMHMM predicted helices disregarded by mutations and under consideration of different degrees of freedom. Here, a threshold determines the number of approved variable positions within EIPPs. Matches were registered and marked in the respective helices and sequence similarity of the incurred interacting ranges compared to known structures was calculated. In addition, the familyspecificity of EIPPs leads to family-specific classifiers and thus to the ability to detect an family affiliation of unknown structures that contain mutation affected homologous sequence parts. Here, it is important to mention that this task is not aimed at developing a new and better approach to classify proteins like Pfam does it with their Hidden Markov models. We will only demonstrate the specificity of mutation affected interacting sequence parts of a given protein family.

Results and discussion
EIPPs were derived from known crystal structures of different membrane protein families. PAs provide evolutionarily induced variable positions within EIPPs. Like previously described, evolutionary covariation have been detected in EIPPs. In some cases, aligned SWs with up to one mutated position are responsible for multiple covariation within an EIPP member. One could have given the evolution more leeway and aligned SWs could have been designed with more than one mutated position, because it is a fact that the evolution allows more variance at the variable pattern positions to maintain structure and function. Our results show that the evolution provides basic building blocks, which are significant for the transmembrane environment like described in previous works [1,21,23]. The evolution itself determines the sequence variability and thus the variance of the variable pattern positions. If we consider each EIPP member as a basic building block we obtain a global view for this interacting sequence part in relation to a single residue. Thereby, we bypass the analysis of each residue to obtain structurally interacting units. The visualization of generated pattern interaction networks (Figure 2) supports the understanding, which pattern pairs of different length are generally involved in spatial interaction by taking into account the evolutionary background. We obtain important information about variable pattern positions that are subjected to a mutation without influencing attractive pattern interactions. The application of interaction tree schemes can lead to better indicators in laboratory mutagen investigations. More specifically, this supports the investigation of mutational variants causing different diseases like e.g. Nephrogenic diabetes insipidus. Incidentally, for reasons of incomplete TMPad information not all position specific mutations are an integral part of our EIPPs. Only EIPP related mutations were collected if any contact could be detected from TMPad. Regarding this tree information, different known structures of PF01036 were analysed for EIPPs. The investigation of Rhodopsin-like proteins represents a major subject of research. Here different structure-function studies were performed [32,33]. Further, the investigation of active core fluctuations, the folding core and kinetics and the involved residues have been treated extensively in previous studies [34][35][36]. In this work, Bacteriorhodopsinlike protein structures were used to evaluate the derived EIPPs. Representatives of the statistically most interacting motifs were searched. Furthermore, long motif XYn (n = 9) representative patterns show a greater tendency to interact more frequently than short ones, because of the larger number of possible residue-residue interaction combinations. The examples given in Figure 3 show, how different EIPPs comprise structural tasks and spatial interactions. Specifically, the evolution presents how EIPPs contribute to emerge different evolutionary mutation types. These types describe the sequence variability on a closer way, which has no significant influence on the protein structure and function.
These are described in more detail below: 1. Simple residue replacements that are not involved in any interaction. Tend to be an important block within an EIPP member, thus the structure can be folded without any task to build important spatial contacts ( Figure 3A).

Contact specific mutations within evolutionary
patterns. An amino acid with the responsibility to build a spatial contact to another helix will be replaced by an amino acid without modifications of the residue-residue interaction network. This can only be realized using amino acids with similar properties of the replaced residues. Here, the length and the spatial orientation play a major role to be a suitable replacement. As injunctive contact example shown in Figure 3B1:  [16] explained the simplicity between evolutionary substitutions and residue-residue contacts. "If two residues of a protein or a pair of interacting proteins form a contact, a destabilizing amino acid substitution at one position is expected to be compensated by a substitution of the other position over the evolutionary time-scale, in order for the residue pair to maintain attractive interaction". For in-depth discussions and evaluations see [16]. These results can be seen in our frequently interacting motif pair AL8-LI8. shown in Figure 3C. right down to contact specific position. Thereby, common amino acids take place to cope the complete change. Such amino acids are e.g. tryptophane (Trp) with the important role in membrane proteins as described in previous work [37].
In the following, a summary on how to use EIPP data for structure prediction is given. As a proof of concept, 116,810 EIPPs (PF01036) and 63,283 EIPPs (PF00230) ( Table 3) were extracted from known structures of the corresponding protein families (see Additional file 1). Here, the number of EIPPs is given by interacting patterns with different lengths. These include interaction members with permanently assigned positions and members that are  Here, 372 of 438 (PF01036) and 5,993 of 6,420 (PF00230) representative proteins have been correctly assigned to their families under the consideration of the evolutionary degree of freedom. Caused by the increase of variable positions, EIPPs became more non-specific for a membrane protein family and more proteins are incorrectly assigned. Misclassified indicate no EIPPs in the investigated membrane helices and thus no sequence similarity due to heterologous sequence parts. The reason is the restriction to allow only single mutations within aligned SWs. This leads to the fact that not all positions are considered by our algorithm. Sequence homology causes generated EIPPs to be a part of current unknown structures of the investigated protein family. Generally, our classification result shows that unknown structures can be assigned to a membrane protein family  by our described method. Furthermore, registered EIPPs were marked and compared to known structures. As shown in Figures 6 and 7  Moving forward, we discuss the structural similarity results. EIPPs as interacting structural blocks are specific within a membrane protein family and for the folding of each TM-helix within a membrane protein. To recover EIPPs on a unknown structure sequence, EIPPs must occur in the helix that reflects the known structure. In this case, we had to fall back on TMHMM, a known secondary prediction tool. This dependence means that the discussed approach does not perform better than the best secondary prediction tool. On the other side, EIPPs provide TM-helical information from known structures. This leads to the possibility chance to refine secondary structure prediction tools and can be discussed in further works. Finally, our method can be used to improve sequence-based methods for classification and protein homology detection.

Conclusion
In this work, we have demonstrated an approach for extracting short, spatially interacting amino acid subsequences -so called evolutionary interaction pattern pairs (EIPPs) -from known crystal structures of α-helical membrane protein families and underlying sequence data of protein family members. Finally, it is outlined how EIPPs can be utilized to predict protein structure. Here, covariation within motif representative homologous sequence patterns have been detected using a pattern alignment algorithm. In combination with interaction information from TMPad [29], EIPPs were obtained and employed to generate interaction trees. Thereby, we are able to show how different interacting patterns differ evolutionarily. Moreover, they have been evaluated using known structures of Bacteriorhodopsin-like proteins and discussed in detail. Here, different mutation types emerge to create an evolutionary instrument to realise sequence variability within a protein family. Furthermore, EIPPs have been used to generate family-specific classifiers. Representative proteins with unknown secondary structure have been used to predict α-helical sequence information using TMHMM [6,7]. Finally, family-specific protein separation has been performed and the structural similarity to known structures of the related protein family has been calculated. Addressed to structure similarity, our method describes how different interacting patterns with evolutionary background contribute to register a protein family affiliation. We are also able to determine the most similar unknown to known structures of a given α-helical membrane protein family. We also produced a good agreement with recently published studies that the evolution provides basic building and interacting blocks for maintaining structure and function. Due to sequence homology such blocks are repeated and we have proven structural conservation. The contemplation of a sequence from the perspective of such blocks facilitates the understanding how membrane protein structures of a family are constructed. Last but not least, low-cost rapid computational methods can be developed to support, extend or refine classification and prediction methods for membrane proteins.