A generalized analysis of hydrophobic and loop clusters within globular protein sequences
© Eudes et al; licensee BioMed Central Ltd. 2007
Received: 06 November 2006
Accepted: 08 January 2007
Published: 08 January 2007
Hydrophobic Cluster Analysis (HCA) is an efficient way to compare highly divergent sequences through the implicit secondary structure information directly derived from hydrophobic clusters. However, its efficiency and application are currently limited by the need of user expertise. In order to help the analysis of HCA plots, we report here the structural preferences of hydrophobic cluster species, which are frequently encountered in globular domains of proteins. These species are characterized only by their hydrophobic/non-hydrophobic dichotomy. This analysis has been extended to loop-forming clusters, using an appropriate loop alphabet.
The structural behavior of hydrophobic cluster species, which are typical of protein globular domains, was investigated within banks of experimental structures, considered at different levels of sequence redundancy. The 294 more frequent hydrophobic cluster species were analyzed with regard to their association with the different secondary structures (frequencies of association with secondary structures and secondary structure propensities). Hydrophobic cluster species are predominantly associated with regular secondary structures, and a large part (60 %) reveals preferences for α-helices or β-strands. Moreover, the analysis of the hydrophobic cluster amino acid composition generally allows for finer prediction of the regular secondary structure associated with the considered cluster within a cluster species. We also investigated the behavior of loop forming clusters, using a "PGDNS" alphabet. These loop clusters do not overlap with hydrophobic clusters and are highly associated with coils. Finally, the structural information contained in the hydrophobic structural words, as deduced from experimental structures, was compared to the PSI-PRED predictions, revealing that β-strands and especially α-helices are generally over-predicted within the limits of typical β and α hydrophobic clusters.
The dictionary of hydrophobic clusters described here can help the HCA user to interpret and compare the HCA plots of globular protein sequences, as well as provides an original fundamental insight into the structural bricks of protein folds. Moreover, the novel loop cluster analysis brings additional information for secondary structure prediction on the whole sequence through a generalized cluster analysis (GCA), and not only on regular secondary structures. Such information lays the foundations for developing a new and original tool for secondary structure prediction.
Prediction of secondary structures is a fundamental basis for protein structure prediction. This provides constraints for finding remote homologues with low sequence similarity for comparative modeling (e.g. ) and starting points for fold recognition (e.g. ). This information is particularly useful to infer biological function from the expanding sequence data originated from genomes, as the gap between sequences and experimental structures is continuously growing.
Current prediction methods generally extract information from known experimental structures and use it for predicting secondary structures in unknown sequences. A substantial improvement in secondary structure prediction has been made by taking into account the evolutionary information provided by the divergence of protein sequences belonging to a same structural family (e.g. [2, 3], reviewed in ). Predictions now reach accuracy around 75–80 % of all residues predicted correctly on the basis of three (alpha, beta, coil) states (three-state per residue-based accuracy). Accuracy limitation may come from inconstancies between the different secondary structure assignment methods , but also from long-range interactions, which are not considered in current predictive tools [6, 7].
Implicit information about secondary structure can be efficiently considered in the sequence comparison by using an original, lexical approach, called Hydrophobic Cluster Analysis (HCA) [8, 9]. This information can be directly unraveled from the analysis of the primary structure and without necessarily use of multiple alignments. Hydrophobic clusters delineated using HCA are indeed statistically centered on the regular secondary structure elements, whatever their nature maybe (alpha-helix or beta-strand) . The definition of hydrophobic clusters delineated through HCA relies on two parameters: the hydrophobic alphabet and the connectivity distance, which sets up the minimal number of non-hydrophobic amino acids separating two different clusters. This HCA connectivity distance originates from the constant curvature of the 1D sequence space into the Euclidian three-dimensional space along a helical path, and from the associated use of a two-dimensional support to represent the protein sequence. The VILFMYW alphabet and the connectivity distance of 4 (corresponding to the α-helix curvature) allow the better correspondence between the hydrophobic clusters and regular α or β secondary structures . The VILFMYW alphabet is also supported by the greater propensities of these residues to be included in regular secondary structures than in coils , as well as by their general burying [11–13]. An interesting feature of hydrophobic clusters, due to the use of a connectivity distance constraint, is that they cannot be intertwined, i.e. they cannot include or be included in any other hydrophobic clusters. As a consequence, hydrophobic clusters are considerably better markers of regular secondary structures than simple binary patterns of hydrophobic/non hydrophobic residues, which do not depend on a connectivity distance .
The power of HCA in revealing the position and often the nature of regular secondary structures from the analysis of a single amino acid sequence makes it an efficient tool for comparing sequences of distantly related proteins, identifying remote relationships and deciphering orphan sequences (e.g. [15–19]; see  for a list of investigations performed by our group). The secondary structure compatibility of the compared sequences can be rapidly estimated, and importantly, the limitations of alignments provided by standard similarity search programs, especially for the handling of indels, can often be overcome. Indeed, HCA does not suffer from the presence of indels, even if they are large (e.g. domain insertion). Of note is the accuracy of secondary structure information that can generally be obtained about orphan sequences, for which no homologue can be identified in databases by standard similarity searches, or sequences having only close homologues.
However, the efficiency of HCA largely depends on the user expertise, which has also hampered heretofore its application for large-scale genome analyses.
The general correspondence between hydrophobic clusters, taken as a whole, and regular secondary structures has been demonstrated several years ago , but no detailed analysis of their individual structural behaviors and preferences for α or β secondary structures has yet been reported. Here, we describe the frequencies of association with secondary structures and secondary structure propensities of 294 hydrophobic cluster species, defined only by their dichotomy in hydrophobic/non-hydrophobic residues, and which are frequently observed in protein globular domains. The resulting dictionary can help to interpret the HCA plots of protein sequences and to compare them. The observed secondary structures of hydrophobic cluster species typically associated with α helices and β strands were also compared with the predictions made on the basis of current tools, such as PSI-PRED . Finally, we also investigated the behavior of loop forming clusters, using an appropriate loop alphabet with the same connectivity distance as for hydrophobic clusters. Such investigation may bring additional information for secondary structure prediction on the whole sequence using a generalized cluster analysis (GCA), and not only on regular secondary structures. They also lay the foundations for developing a HCA-based, automatic tool for secondary structure prediction.
Hydrophobic cluster analysis
Definition of hydrophobic clusters
The principles of Hydrophobic Cluster Analysis (HCA) have been previously detailed [8, 9]. Briefly, HCA relies on a helical curvature of the "1D" representation of the amino acid sequence in a space of higher dimensionality (3D). This allows the detection and visualization, through a 2D transposition of the sequence (the HCA plot), of the local hydrophobic compactness (hydrophobic clusters) largely associated with internal faces of regular secondary structures (α-helices and β-strands). Hydrophobic clusters definition depends on three parameters: i) a representative hydrophobic alphabet, ii) an optimal helical pitch to curve the 1D amino acid sequence space in a constant way, iii) a connectivity distance, depending on the considered helical pitch and corresponding to the number of non-hydrophobic residues separating two hydrophobic clusters defined as distinct. Seven hydrophobic residues (V, I, L, F, M, Y, W) integrate the HCA hydrophobic alphabet. These residues are, with cysteine, the most buried [11–13] and are more often associated with regular secondary structures (RSS, α-helices and β-strands) than with coils . The optimal helical pitch (for both α and β RSS), assessed by the best correspondence between RSS and hydrophobic clusters , is the α-helix pitch, with an associated connectivity distance of 4 amino acids (approximately one helix turn).
Redundancy of databases must be reduced to avoid statistical bias. At the same time, working with weakly redundant databases does not allow valuable statistics on a large number of cluster species. For example, just 97 hydrophobic cluster species are represented at least 30 times in our 5 % database (in which sequences do not share more than 5 % identity with any other sequence of the bank), against 150, 250 and 304 in the 25 %, 50 % and 90 % databases, respectively (see Additional file 1). It is however interesting to note that even at very low level of redundancy (5%), the 97 informative hydrophobic cluster species gather a large fraction of the total number of hydrophobic clusters in the bank (73 %, against 76 % and 81 % in the 25 % and 90 % databases, respectively). If the simplest and highly populated hydrophobic cluster species 1(P-code 1) and 11(P-code 3) are omitted from this calculation (these are indeed weakly associated with regular secondary structures – see below), the remaining hydrophobic clusters belonging to informative species (with at least 30 members) totalize 61%, 65 % and 73 % of the total numbers of hydrophobic clusters of the 5 %, 25 % and 90 % databases, respectively. However, the use of banks with higher redundancy extends the set of informative cluster species to higher lengths. Hence, whereas only 15 informative cluster species of length 9 can be exploited at 5 %, 70 and 64 cluster species of length 9 and 10, respectively, are available at 90 % redundancy, and a few cluster species can be found up to length 15 (see Additional file 1). This allows getting statistics for clusters associated not only with β-strands, but also with many α-helices.
A practical way to solve the redundancy problem with regard to cluster species is to select, species by species, the appropriate level of redundancy by observing the evolution of the hydrophobic cluster occurrence within a given species as a function of redundancy. Abrupt thrust in the curves, translating the existence of very similar sequences, can easily be visualized (see Additional file 2, illustrating a large set of clusters (top panel), for which occurrences are normalized relative to the values observed in the 50 % database, taken as a reference, as well as the particular case of two clusters (bottom panel). The 50 % level was chosen as a reference because it roughly corresponds to the inflection point of the curves reporting, species by species, cluster occurrences as a function of the redundancy level. Moreover, this mid value allows the similar handling of extreme levels of redundancy (5 % and 95 %)). For example, occurrences of hydrophobic clusters corresponding to the P-code 153 (10011001), which are predominantly associated with α-helical structures (see below), grow continuously as a function of redundancy, whereas occurrences of hydrophobic clusters of species 137 (10001001, also predominantly associated with alpha-helices), suddenly grow faster from 80% of redundancy (see Additional file 2, bottom panel). As a consequence, we selected the 90% redundancy statistics for cluster species behaving like 153 (88 % of the total number of species), whereas we chose the appropriate lower level of redundancy for others (no less than 70 %, e.g. 75 % was chosen for the species 137). Ten cluster species out of the 304 that were initially considered in the 90 % database were discarded, to reach a final number of 294. The chosen levels are indicated in the table reporting the structural preferences of these 294 hydrophobic clusters, supplied as Additional file 3 (also see ).
Association of hydrophobic clusters with secondary structures
Frequencies of association of hydrophobic clusters with secondary structures: comparison of the different methods for secondary structure assignment.
Propensities of hydrophobic clusters for secondary structures: comparison of the different methods for secondary structure assignment.
Two different ways were considered to analyze the association of a hydrophobic cluster species with secondary structures (see Material and Methods). On one hand, a "raw mean" can be deduced by calculating the strict percentages of H, E and C assignments within the cluster limits (the APC rule for "a ll p ositions c onsidered"). This rule has the advantage of accounting for all cluster positions and, in particular, of revealing "strong" cluster species for which the majority of positions are associated with a defined regular secondary structure (e.g. in Additional file 3: species 15 (1111) and 31 (11111), for which the frequencies of association with β-strands according to the APC rule are 81 and 78, respectively). The APC rule has however two disadvantages: i) the signal associated with regular secondary structures tends to be "faded" by the coil signal coming from the cluster borders, as the limits of hydrophobic clusters often do not exactly correspond to those of regular secondary structures; ii) it is impossible to estimate whether there is on average a single regular secondary structure associated to the hydrophobic cluster or several ones. Another way to proceed and overcome these limitations is to consider that if within a cluster one or more contiguous amino acids are assigned H (or E), the entire cluster is assumed to be associated with a helix (or with a strand) (the OPS rule for "o ne p osition is s ufficient"). This excludes the very few clusters containing amino acids that are associated with the two regular secondary structures (helix and strand), as well as the more numerous clusters that contain at least two different strings of regular secondary structures separated by coil positions. These peculiar clusters are called multiple (M) and are considered separately. The main artifact that may arise from this OPS rule is a potential poor coverage of the hydrophobic cluster limits by the regular secondary structures. We therefore calculate, for each species, the average rates of residues assigned H (α-coverage) or E (β-coverage) within the hydrophobic cluster limits, as described in the Methods section. This coverage is close to a Qi value, calculated on the cluster limits, but differ from SOV values, which would take into account the overflowing of the observed regular secondary structures outside the cluster limits. We frequently observed high coverage values (>70 %, Additional file 3), in particular for helices (93.2 % of the observations reported in Additional file 3, versus 45.7 % for the strands (this calculation excludes species 1, for which the coverage value is obviously 100 % and species for which the considered regular secondary structure is not observed)). The resulting mean values are high (83.5 % (α) and 68.2 % (β)), revealing a good coverage of the hydrophobic clusters by regular secondary structures. It is worth noting that coverage values are, on average, lower for strands. For lengths lower or equal to the mean length of β-strands (6 residues), there is however no difference, on average, between α-helix and β-strand coverage values associated with the same hydrophobic cluster species. Differences appear for higher lengths: the strand assignments do not cover, on average, all the hydrophobic cluster positions. Relatively low values can also originate from the shift frequently observed between the gravity centers of regular secondary structures and hydrophobic clusters (0.6 and 0.3 residues on average towards the N-terminus for α-helices and β-strands, respectively (mean standard deviations σα = 2.8 and σβ = 2.1, respectively; association with regular secondary structures assigned using the OPS rule)). This shift can exacerbate differences observed between α-helix and β-strand coverage values, mainly resulting from their different mean lengths. Finally, the correlation coefficients calculated between APC and OPS percentages reported in Additional file 3 indicate that these variables are highly correlated (0.96 (helices) and 0.93 (strands)).
For many hydrophobic clusters, there is a clear preference for a particular regular secondary structure (called preferred secondary structure of a cluster species). This is illustrated in Additional file 3, which reports the percentages of association of each hydrophobic cluster species with secondary structures (using the APC and OPS rules described above) and secondary structure propensities, calculated using the OPS rule as described in Material and Methods. These structural preferences are detailed in Figure 4, which illustrates the numbers of hydrophobic cluster species that are preferentially associated with the different secondary structures (with respect to frequencies of association with secondary structures (top) and secondary structure propensities (bottom)). Hence, it can be observed that 60 % of the hydrophobic cluster species analyzed in this study shows preference either for helices or for strands (preferred secondary structure of the cluster species). The secondary structure propensities are also illustrated in Figure 2, in which cluster species are classified according to their occurrence in the 90 % database.
Contrasting with the vast majority of hydrophobic clusters, which are mainly associated with regular secondary structures (mainly with helices, mainly with strands or with either helices or strands), the smallest hydrophobic clusters (1 and 11; P-codes 1, 3), which contain too few amino acids to constitute stable elements of regular secondary structures, are mainly associated with coils (Additional file 3). Also notable is the relative low level of association with regular secondary structure of the basic hydrophobic clusters V (11; P-code 3), M (101; P-code 5), U (1001; P-code 9) and D (10001; P-code 17), on the basis of which all cluster species can be built. These hydrophobic clusters are also not sufficiently rich in hydrophobic residues to constitute, by themselves, stable regular secondary structures. In the dictionary presented in Additional file 3, the total frequency of hydrophobic clusters associated with coils, following the OPS rule, is 28.1% (the small clusters, which are preferentially associated with coils, are also highly populated) whereas, when omitting for the calculation hydrophobic clusters with P-codes 1, 3, 5, 9 and 17, this frequency falls to only 6.5 %.
It is important to note that the 2D shapes on the HCA plot of hydrophobic clusters associated with α-helices are almost identical to their actual 3D counterparts within the protein architecture. Indeed, the standard support of HCA to curve the 1D space is the α-helix (connectivity distance (CD) 4) and thus, there is a direct correspondence between 2D and 3D hydrophobic clusters. The hydrophobic cluster with P-code 153, shown in Figure 5, illustrates this feature. For clusters associated with β-strands (e.g. P-code 45 in Figure 5), there is also quite good shape conservation between hydrophobic clusters on the 2D HCA plot and on the actual 3D structure. This is because the extended structure is mathematically a 2D degenerated helix with a connectivity distance of 2 (CD 2) . Therefore, the HCA transposition offers at a glance the actual or nearly actual shape of the internal faces of α and β regular local structures.
The secondary structure percentages and propensities described in Additional file 3 may thus provide a direct and simple way to predict the likely secondary structure associated with a hydrophobic cluster within a single sequence. Moreover, considering the chemical nature of amino acids belonging to the hydrophobic clusters (as well as to their neighborhoods) generally allows the refinement of secondary structure prediction associated with the cluster species. Indeed, differences in the global amino acid composition can be observed between the preferred secondary structure state (helix or strand) and the other state associated with each cluster species, as exemplified in Figure 3, which illustrates two representative clusters mainly associated with alpha helices and beta-strands, respectively. Indeed, a clear preference for leucine and alanine is observed for α-helix configuration (preferred secondary structure state) of cluster species 153 (10011001), whereas glycine and cysteine are more frequent in the β-strand configuration. Similar trends are observed for the β-strand-associated species 29 (11101), for which leucine is preferred in the α-helix configuration, whereas valine and isoleucine are more frequent in the β-strand configuration (preferred secondary structure state).
"Multiple clusters", which are associated with at least two regular secondary structures, are frequently encountered within some hydrophobic cluster species. They are logically observed in long clusters (e.g. cluster species 1577 (11000101001), 64 % multiple (multiple propensity of 3.91)) but are also detected in clusters of relatively small sizes (e.g. cluster species 199 (11000111) in Figure 5 (bottom panel), or 325 (101000101), 43 % and 38 % multiple, respectively). The regular secondary structures covered by the cluster limits are generally separated by short loops or can even be contiguous (e.g. cluster 199 in Figure 5 (bottom panel)). As a rule of thumb, multiple clusters most often include a stretch of three contiguous non-hydrophobic residues (0), indicating the presence of a short loop. The loops included in larger multiple clusters generally contain an isolated hydrophobic residue, which makes the bridge between the different parts of the hydrophobic cluster matching the regular secondary structures. An example of such a scenario is observed in Figure 5 (top panel) for cluster 1010100010001001111(P-code 345167, not included in Additional file 3 due to a too low occurrence). For clusters of sufficient length, changes in the periodicity in hydrophobic/non-hydrophobic residues are generally indicative of the coverage of several regular secondary structures. These changes (or non-homogeneity) can reveal the transition from a strand to a helix (from a periodicity in hydrophobic residues of 2 to a periodicity of 3/4) or between two strands (two high densities in hydrophobic residues separated by a region containing less hydrophobic residues (cluster 345167 1010100010001001111 in Figure 5 (top panel)).
Finally, one can also note that symmetric hydrophobic clusters generally possess similar structural behaviors (see for example the symmetric clusters of length 6 : 35 (100011; 39 % α)/49 (110001; 41 % α); 37 (100101; 46 % β)/41 (101001; 46 % β); 39 (100111; 55 % β)/57 (111001; 55 % β); 43 (101011; 69 % β)/53 (110101; 69 % β); 55 (110111; 49 % β)/59 (111011; 58 % β); 47 (101111; 83 % β)/61 (111101; 87 % β)).
Loop cluster analysis
Definition of loop clusters
Hydrophobic cluster analysis relies on the marked propensities of hydrophobic residues (VILFMYW) to constitute the internal faces of regular secondary structures. Conversely, a second group of amino acids, constituted by P, G, D, N and S, are clear markers of loops (or coils), as they have higher loop-forming propensities than for the two regular secondary structures .
Thus, we aimed at investigating "loop" clusters that are formed by the five residues P, G, D, N and S, using the standard connectivity distance of 4. Loop clusters are defined in the same way that hydrophobic clusters, with PGDNS coded by "1" and any other amino acid coded by "0" (Figure 1). Any loop cluster begins and ends by "1" and no stretch of more than 3 consecutive "0" can be found within its limits. In order to avoid the overlap with regular secondary structures (centered on hydrophobic clusters), we omitted from the primitive PGDNS clusters (gray in Figure 1) information belonging to hydrophobic clusters that would be included in it. However, in this context, we did not consider the hydrophobic clusters 1 and 11, which are poorly associated with regular secondary structures (see above). The resulting clusters (colored blue in Figure 1) would mostly cover coil positions and were named "loop" clusters. Thus, hydrophobic clusters and loop clusters are not intertwined. Hence, in the example shown in Figure 1 and while using the rule described above, the first defined PGDNS cluster DQRNTLDLI APSPAD (100100100011101), in which the hydrophobic cluster is underlined, is restricted to the DQRN (loop cluster 1001) and PSPAD (loop cluster 11101) sequences, with each loop cluster beginning and ending by a «1». The hydrophobic cluster 1(M at the end of the sequence, Figure 1) is included in the loop cluster SGSMD (11101). Mean APC coil values were calculated for the PGDN(S) and for the loop clusters present in the 90 % redundancy database, using a PGDNS alphabet, as well as a reduced PGDN alphabet. The coil preference is maximal for loop clusters (67.0 % coil), relative to PGDNS clusters (51.9 % coil), thus supporting the subtraction of the hydrophobic information. Mean APC coil values are also higher for loop clusters when using the PGDNS alphabet (67.0 % coil) instead of the reduced PGDN alphabet (64.7 % coil).
Association with coil structures
Only the APC rule was used here to appreciate the general correspondence between loop clusters and observed secondary structures, as we only considered the global coil percentage associated to each loop cluster species. Indeed, the OPS rule, if used, would have associated the 100 % value with nearly all loop clusters. Results with the consensus assignment are detailed below. However, similar results were obtained using other assignment methods (data not shown).
As for hydrophobic clusters towards regular secondary structures, there is a clear preference of loop clusters for coil structures (Additional file 5 and Figure 6). For a given length, the coil frequencies increase with the number of "1" (P, G, D, N or S), the highest ones being observed for "1"-rich loop clusters. This behavior contrasts with that observed for hydrophobic clusters, in which a right balance in "1" and "0" (not too few and not too many) is generally observed . For loop clusters of moderate length (up to 7), the frequencies of association with β-strands are relatively constant (~10 %) whereas the α-helix frequencies vary between 10 and 30 %. This observation illustrates the frequent overflowing of regular secondary structures (especially α-helices) outside of the hydrophobic cluster limits, within the loop cluster borders. Of note is the higher participation of loop clusters exclusively composed of "1" in regions of experimental structures lacking observable electronic density and attributed by current assignment methods as "coils" (loop cluster 127 (1111111) in Figure 6).
An overall gain in coil frequencies was observed when loop clusters were considered rather than PDGNS clusters, thus omitting from PDGNS clusters the information provided by hydrophobic clusters (normalized difference above 0 in Additional file 6). This gain is particularly marked for clusters rich in 0, which can include hydrophobic residues making part of hydrophobic clusters and associated regular secondary structures.
Comparison of secondary structure assignments and PSI-PRED predictions within the hydrophobic cluster limits
Hydrophobic Cluster Analysis (HCA) is generally used in an empirical way to combine secondary structure information with analysis of the primary structure. Indeed, it provides a direct, accurate statistical access to the gravity centers of regular secondary structures through hydrophobic clusters [10, 14]. HCA contrasts with other predictive approaches based on hydrophobicity and use of binary patterns (e.g. [25–27]) by providing additional, topological information through the connectivity distance associated with the use of a 2D representation. This information allows evolving from a literal analysis to a lexical one. The so-defined hydrophobic clusters are non-intertwined (they can not include nor to be included in other hydrophobic clusters), and thus correspond to words, which are structurally relevant as they match the positions of regular secondary structures . However, outside the earlier description of the general correspondence of hydrophobic clusters and regular secondary structures, independent of their nature (helices or strands), no detailed analysis has been yet published with respect to the individual secondary structure preferences of each hydrophobic cluster species.
In a more general way, this analysis of amino acid composition would also be a critical point for accurately predicting the nature of secondary structures associated with the different hydrophobic cluster species. Amino acid profiles, calculated for each of two regular secondary structure (α and β) associated to each cluster species should allow the refinement of the secondary structure prediction. This however requires solving to a better extent the problem raised by multiple clusters, that is clusters which are associated with at least two regular secondary structures (see below).
Information about amino acid composition would also be useful for refining the analysis of truncated hydrophobic clusters that only cover a limited part of the associated regular secondary structure. Examples of "truncated" hydrophobic clusters can be found in Figure 1 (P-code 403) and Figure 10 (P-code 19). Amino acids in the vicinity of these hydrophobic clusters, such as A, C and T that can substitute for strong hydrophobic residues, might indicate the overflowing of the hydrophobic cluster by the associated regular secondary structure.
Within families of proteins, hydrophobic clusters rapidly evolve relative to sequence divergence, around a stable core of "topohydrophobic" residues (hydrophobic residues which occupy, in a multiple alignment, positions that are always substituted by hydrophobic amino acids and constitute the core of globular domains) [36, 37]. In this context, an interesting perspective to help the HCA-based comparison of divergent sequences is to analyze the cluster substitution schemes within families of sequences, around the invariable kernel of topohydrophobic residues. The so-defined cluster substitution matrices might thus constitute sensitive tools to identify structural conservation at low levels of sequence identity. Alternatively, "canonical" clusters, which show clear preferences for α or β secondary structures, can orientate in a recursive way the prediction for other less typical clusters, with which they are aligned within a family of proteins.
Noticeably, characteristics deduced from the analysis of hydrophobic cluster species appear quite stable relative to divergence, as illustrated in this study by several variables analyzed at different levels of redundancy (secondary structure association frequencies, cluster propensities, amino acid compositions). The same behavior is observed with loop clusters. As a consequence, high redundancy databases can be exploited in order to get a high number of statistically valuable clusters, provided that artifacts are eliminated by cluster species (see Results section). Information can be gained about the majority of strands and some helices (cluster lengths up to 13 residues). This size limitation will be progressively overcome by the continuous increase in experimental three-dimensional structures reported in the Protein Data Bank, even though new entries are related to already known structures, owing to this non-limitation relative to low redundancy levels. A specific handling of long hydrophobic clusters might also allow the separation of long clusters, which are often "multiple" (that is, covering at least two regular secondary structures but with loops of length inferior to the considered connectivity distance), into their single components and, as a consequence, make them available for prediction, as suggested by recent results. This procedure would consider clusters of PGNDS residues included in hydrophobic clusters (thus opposite to loop clusters), which often underline the presence of secondary structure separators into these multiple clusters. At the same time, the HCA formalism can be adjusted to the context to improve predictions. Depending on cluster species, different helical pitches (i.e. connectivity distance or CD) can indeed be considered () to optimize the discrimination power. For particular clusters, one isolated hydrophobic residue, located in the N-ter or C-ter, may artificially lengthen it, leading to a biased prediction of the secondary structure limits (e.g. in the sequence DPKKINTRFLLY TNENQ, the two first positions of the hydrophobic cluster, which is underlined, are assigned as coil). As a consequence, a lower connectivity distance (CD) would be more appropriate for such clusters. Accordingly, one can observe, for example, that the hydrophobic cluster species 10011001 has a slightly higher discrimination power with a CD of 3 rather than 4. In contrast, 110101 shows an optimum for a CD of 5. Considering systematically lower CD however leads to the artifact increase of helix-associated clusters of small length, and to an overall decrease of the structural two-state (regular secondary structure/coil) overlap between hydrophobic clusters and regular secondary structures. On another hand, higher CD generally leads to the artifact increase of the number of multiple clusters. CD4 appears to be the best compromise between discrimination power and artifact minimization. Otherwise, symmetric clusters (e.g. 1011 and 1101) typically show similar behavior relative to association with secondary structures, as shown in the Results. Thus, for sparsely populated clusters, the addition of symmetric cluster data may be an interesting way to expand the set of useful clusters, in particular for high cluster lengths.
In combination with loop clusters, which were here analyzed for the first time, hydrophobic clusters may provide useful and accurate information about secondary structures of a large part of globular protein domains, from the only knowledge of a single sequence. The dictionary of hydrophobic and loop cluster presented here may help the user to apply such a methodology. The analysis deduced form single sequences could of course be improved by considering multiple alignments of sequences belonging to a same family, when possible. Indeed, some hydrophobic clusters, which are not very informative or for which statistics are lacking, can be substituted in other sequences of the family by much more informative clusters, allowing to refine the HCA-based prediction. HCA can thus be used in an iterative and synergetic way with other efficient and automatic methods, in order to improve prediction taking into account multiple alignments. Finally, the general statistics described here, relative to the structural preferences of hydrophobic and loop clusters and to their structure-dependent composition in amino acids, could be used to design a predictive tool, which might be integrated in automatic procedures for comparing highly divergent sequences.
Materials and methods
We considered the SCOP database (version 1.69, July 2005) [38, 39], in which we selected proteins of the first five classes (all alpha, all beta, alpha/beta, alpha+beta, multidomains). From this set (23571 PDB files, corresponding to 49068 protein chains), we discarded protein chains of the first five classes, which are also reported in other SCOP classes. We only kept X-ray structures, and discarded protein models and obsolete entries, following the classification available from the Research Collaboratory for Structural Bioinformatics (RCSB) database , as well as files containing only Cα coordinates. Our final database contains 46228 protein chains. Then, we used the PISCES server  for culling protein chains from this list by sequence identity (ranging from 5% to 95% identity, by step of 5 %). The different databases obtained in this way contain from 1815 (5%) to 7015 (95 %) protein chains. The databases specifically used in this study contain 2925 (25%) and 6663 (90 %) protein chains.
Secondary structure assignments
We used different methods for assigning secondary structures from atomic coordinates, including DSSP , STRIDE  and PSEA . Secondary structures defined in the PDB files (SS_PDB) were also considered. A consensus assignment was deduced from the consideration of these four methods. A standard reduction of the different secondary structure states produced as outputs by DSSP (8 states) and STRIDE (7 states) to three states (helix (H), strand (E) and default coil (C)) was first performed using the "EVA" rule conversion ([3, 42]). In this scheme, α-helix (H), 310 helix (G) and π-helix (I) are converted to H, extended (E) and isolated β-bridge (B) to E and turn (T), bend (S) and other to C. The consensus assignment was then obtained by indicating the most frequent secondary structure state among the DSSP, STRIDE, PSEA and SS_PDB assignments. If two secondary structures arise with the same frequency, the regular secondary structure is preferred over coil. If a helix assignment competes with a strand one, the corresponding position is assigned as coil.
Association of clusters with secondary structures
Two rules can be defined to measure the association of hydrophobic clusters with secondary structures. The APC rule (all positions considered) takes into account the assignments linked to all positions of a given cluster species and can be calculated as:
where s is the secondary structure assignment (H, E or C), ns is the number of "s" assignment in a hydrophobic cluster of species X, Nx is the occurrence of hydrophobic clusters within the species X and Lx is the length of the hydrophobic cluster species X.
Another way to evaluate the association of hydrophobic clusters with secondary structures relies on the OPS rule (one position is sufficient). According to this rule, the entire cluster is assumed to be associated with a helix or with a strand if, within this cluster, one or several amino acids are assigned in the helix or strand state, respectively. Clusters which contain both α and β assignments or which contain two strings of regular secondary structures separated by coil positions are considered apart and are called multiple. The frequency of association with the helix, strand or multiple states with respect to the OPS rule can be calculated as follows:
where nXs is the number of hydrophobic clusters assigned as "s" following the OPS rule and Nx the total number of hydrophobic cluster within the species X.
Coverage of the hydrophobic cluster species assigned in the helix and strand state following the OPS rule by the corresponding secondary structures is calculated as follows:
where ms is the number of "s" assignments in a hydrophobic cluster of species X assigned as "s" by the OPS rule, nXS is the number of hydrophobic clusters of species X assigned as "s" following the OPS rule, and LX the length of the hydrophobic cluster of species X.
These values were calculated for all redundancy levels (by steps of 5 %).
Propensities of hydrophobic clusters for a secondary structure state (C, E, H or M) were calculated in the same way than amino acid propensities. Cluster propensities are calculated as follows:
SPX = SFX/FX,
where SFX = (nXS/nS) (SFX is the frequency of the hydrophobic cluster species X in the state "s", nXS is the number of the hydrophobic clusters of species X in the state "s" (OPS rule), nS is the total number of hydrophobic clusters in the state "s" (OPS rule)) and FX = (nX/N) (FX is the frequency of the hydrophobic cluster of species X, nX is total number of hydrophobic clusters of species X and N is the total number of hydrophobic clusters).
Secondary structure predictions
Secondary structures of the different sequences reported in our structure databases were predicted using PSI-PRED .
HCA plots were drawn using the DrawHCA server .
This work was partly supported by European Commission (NoE "3D-EM" contract n° LSHG-CT-2004-502828), by the Commissariat à l'Energie Atomique (LRC-CEA n° 27V), by the Institut National du Cancer and by the association Vaincre La Mucoviscidose. We thank Antonio C. Bianco and Alain Soyer for critical reading of the manuscript.
- Uehara K, Kawabata T, Go N: Filtering remote homologues using predicted structural information. Prot Eng Des Sel 2004, 17: 565–570. 10.1093/protein/gzh065View ArticleGoogle Scholar
- McGuffin LJ, Jones DT: Benchmarking secondary structure prediction for fold recognition. Proteins 2003, 52: 166–175. 10.1002/prot.10408View ArticlePubMedGoogle Scholar
- Rost B, Eyrich VA: EVA: large-scale analysis of secondary structure prediction. Proteins 2001, Suppl 5: 192–199. 10.1002/prot.10051View ArticlePubMedGoogle Scholar
- Rost B: Review: protein secondary structure prediction continues to rise. J Struct Biol 2001, 134: 204–218. 10.1006/jsbi.2001.4336View ArticlePubMedGoogle Scholar
- Colloc'h N, Etchebest C, Thoreau E, Henrissat B, Mornon JP: Comparison of three algorithms for the assignment of secondary structure in proteins: the advantages of a consensus assignment. Protein Eng 1993, 6: 377–382. 10.1093/protein/6.4.377View ArticlePubMedGoogle Scholar
- Kihara D: The effect of long-rande interactions on the secondary structure formation of proteins. Protein Sci 2005, 14: 1955–1963. 10.1110/ps.051479505PubMed CentralView ArticlePubMedGoogle Scholar
- Tsai C-H, Nussinov R: The implications of higher (or lower) success in secondary structure prediction of chain fragments. Protein Sci 2005, 14: 1943–1944. 10.1110/ps.051581805PubMed CentralView ArticlePubMedGoogle Scholar
- Gaboriaud C, Bissery V, Benchetrit T, Mornon JP: Hydrophobic cluster analysis: an efficient new way to compare and analyse amino acid sequences. FEBS Lett 1987, 224(1):149–155. 10.1016/0014-5793(87)80439-8View ArticlePubMedGoogle Scholar
- Callebaut I, Labesse G, Durand P, Poupon A, Canard L, Chomilier J, Henrissat B, Mornon JP: Deciphering protein sequence information through hydrophobic cluster analysis (HCA):current status and perspectives. Cell Mol Life Sci 1997, 53: 621–645. 10.1007/s000180050082View ArticlePubMedGoogle Scholar
- Woodcock S, Mornon JP, Henrissat B: Detection of secondary structure elements in proteins by hydrophobic cluster analysis. Protein Eng 1992, 5: 629–635. 10.1093/protein/5.7.629View ArticlePubMedGoogle Scholar
- Soyer A, Chomilier J, Mornon JP, Jullien R, Sadoc JF: Voronoi tessellation reveals the condensed matter character of folded proteins. Phys Rev Lett 2000, 85: 3532–3535. 10.1103/PhysRevLett.85.3532View ArticlePubMedGoogle Scholar
- Pintar A, Carugo O, Pongor S: Atom depth in protein structure and function. Trends Biochem Sci 2003, 28: 593–597. 10.1016/j.tibs.2003.09.004View ArticlePubMedGoogle Scholar
- Pintar A, Carugo O, Pongor S: Atom depth as a descriptor of the protein interior. Biophys J 2003, 84: 2553–2561.PubMed CentralView ArticlePubMedGoogle Scholar
- Hennetin J, Le Tuan K, Canard L, Colloc'h N, Mornon JP, Callebaut I: Non-intertwined binary patterns of hydrophobic/nonhydrophobic amino acids are considerably better markers of regular secondary structures than nonconstrained patterns. Proteins 2003, 51: 236–244. 10.1002/prot.10355View ArticlePubMedGoogle Scholar
- Callebaut I, Malivert L, Fischer A, Mornon JP, Revy P, de Villartay JP: Cernunnos interacts with the XRCC4/DNA-ligase IV complex and is homologous to the yeast nonhomologous end-joining factor NEJ1. J Biol Chem 2006, 281: 13857–13860. 10.1074/jbc.C500473200View ArticlePubMedGoogle Scholar
- Callebaut I, Prat K, Meurice E, Mornon JP, Tomavo S: Prediction of the general transcription factors associated with RNA polymerase II in Plasmodium falciparum : conserved features and differences relative to other eukaryotes. BMC Genomics 2005, 6: 100. 10.1186/1471-2164-6-100PubMed CentralView ArticlePubMedGoogle Scholar
- Callebaut I, Curcio-Morelli C, Mornon JP, Gereben B, Buettner C, Huang S, Castro B, Fonseca TL, Harney JW, Larsen PR, Bianco AC: The iodothyronine selenodeiodinases are thioredoxin-fold family proteins containing a glycoside hydrolase clan GH-A-like structure. J Biol Chem 2003, 278: 36887–36896. 10.1074/jbc.M305725200View ArticlePubMedGoogle Scholar
- Callebaut I, Moshous D, Mornon JP, de Villartay JP: Metallo-beta-lactamase fold within nucleic acids processing enzymes: the beta-CASP family. Nucleic Acids Res 2002, 30: 3592–3601. 10.1093/nar/gkf470PubMed CentralView ArticlePubMedGoogle Scholar
- Callebaut I, de Gunzburg J, Goud B, Mornon JP: RUN domains: a new family of domains involved in Ras-like GTPase signaling. Trends Biochem Sci 2001, 26: 79–83. 10.1016/S0968-0004(00)01730-8View ArticlePubMedGoogle Scholar
- Publications of the "Protein sequences and folds" group at IMPMC[http://bioserv.impmc.jussieu.fr/publications.html]
- Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292: 195–202. 10.1006/jmbi.1999.3091View ArticlePubMedGoogle Scholar
- Cluster code converter[http://bioserv.impmc.jussieu.fr/converter/index.html]
- Structural preferences of the 294 more frequent hydrophobic clusters[http://bioserv.impmc.jussieu.fr/HCA-table.html]
- Structural preferences of the 167 more frequent loop clusters[http://bioserv.impmc.jussieu.fr/LCA-table.html]
- Rose GD: Prediction of chain turns in globular proteins on a hydrophobic basis. Nature 1978, 272: 586–590. 10.1038/272586a0View ArticlePubMedGoogle Scholar
- Gromiha MM, Ponnuswamy PK: Prediction of protein secondary structures from their hydrophobic characteristics. Int J Pept Protein Res 1995, 45: 225–240.View ArticlePubMedGoogle Scholar
- Palliser CC, Parry DA: Quantitative comparison of the ability of hydropathy scales to recognize surface beta-strands in proteins. Proteins 2001, 42: 243–255. 10.1002/1097-0134(20010201)42:2<243::AID-PROT120>3.0.CO;2-BView ArticlePubMedGoogle Scholar
- Radhakrishnan I, Perez-Alvaredo GC, Parker D, Dyson HJ, Montminy MR, Wright PE: Solution structure of the KIX domain of CBP bound to the transactivation domain of CREB: a model for activator:coactivator interactions. Cell 1997, 91: 741–752. 10.1016/S0092-8674(00)80463-8View ArticlePubMedGoogle Scholar
- Ferron F, Longhi S, Canard B, Karlin D: A practical overview of protein disorder prediction methods. Proteins 2006, 65: 1–14. 10.1002/prot.21075View ArticlePubMedGoogle Scholar
- Coeytaux K, Poupon A: Prediction of unfolded segments in a protein sequence based on amino acid composition. Bioinformatics 2005, 21: 1891–1900. 10.1093/bioinformatics/bti266View ArticlePubMedGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22: 2577–2637. 10.1002/bip.360221211View ArticlePubMedGoogle Scholar
- Frishman D, Argos P: Knowledge-based protein secondary structure assignment. Proteins 1995, 23: 566–579. 10.1002/prot.340230412View ArticlePubMedGoogle Scholar
- Labesse G, Colloc'h N, Pothier J, Mornon J-P: P-SEA proteins: a new efficient assignment of secondary structure from C-alpha trace. Comput Appl Biosci 1997, 13: 291–295.PubMedGoogle Scholar
- Dupuis F, Sadoc JF, Mornon JP: Protein secondary structure assignment through Voronoi tessellation. Proteins 2004, 55: 519–528. 10.1002/prot.10566View ArticlePubMedGoogle Scholar
- Parisien M, Major F: A new catalog of protein beta-sheets. Proteins 2005, 61: 545–558. 10.1002/prot.20677View ArticlePubMedGoogle Scholar
- Poupon A, Mornon JP: Populations of hydrophobic amino acids within protein globular domains: identification of conserved "topohydrophobic" positions. Proteins 1998, 33: 329–342. 10.1002/(SICI)1097-0134(19981115)33:3<329::AID-PROT3>3.0.CO;2-EView ArticlePubMedGoogle Scholar
- Poupon A, Mornon JP: "Topohydrophobic positions" as key markers of globular protein folds. Theor Chem Accounts 1999, 101: 2–8.View ArticleGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540. 10.1006/jmbi.1995.0159PubMedGoogle Scholar
- Andreeva A, Howorth D, Brenner SE, Hubbard TJP, Chothia C, Murzin AG: SCOP database in 2004: refinements integrate structure and sequence family data. Nucl Acids Res 2004, 32: D226-D229. 10.1093/nar/gkh039PubMed CentralView ArticlePubMedGoogle Scholar
- RCSB FTP Archive[ftp://ftp.rcsb.org/pub/pdb/ls-lR]
- Wang G, Dunbrack RLJ: PISCES: a protein sequence culling server. Bioinformatics 2003, 19: 1589–1591. 10.1093/bioinformatics/btg224View ArticlePubMedGoogle Scholar
- EVA measures for secondary structure prediction accuracy[http://cubic.bioc.columbia.edu/eva/doc/measure_sec.html]
- Callebaut I, Dulin F, Bertrand O, Ripoche P, Mouro I, Colin Y, Mornon JP, Cartron JP: Hydrophobic cluster analysis and modeling of the human Rh protein three-dimensional structures. Transfus Clin Biol 2006, 13: 70–84. 10.1016/j.tracli.2006.02.001View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.