Skip to main content

Spectrum of disease-causing mutations in protein secondary structures



Most genetic disorders are linked to missense mutations as even minor changes in the size or properties of an amino acid can alter or prevent the function of the protein. Further, the effect of a mutation is also dependent on the sequence and structure context of the alteration.


We investigated the spectrum of disease-causing missense mutations in secondary structure elements in proteins with numerous known mutations and for which an experimentally defined three-dimensional structure is available. We obtained a comprehensive map of the differences in mutation frequencies, location and contact energies, and the changes in residue volume and charge – both in the mutated (original) amino acids and in the mutant amino acids in the different secondary structure types. We collected information for 44 different proteins involved in a large number of diseases. The studied proteins contained a total of 2413 mutations of which 1935 (80%) appeared in secondary structures. Differences in mutation patterns between secondary structures and whole proteins were generally not statistically significant whereas within the secondary structural elements numerous highly significant features were observed.


Numerous trends in mutated and mutant amino acids are apparent. Among the original residues, arginine clearly has the highest relative mutability. The overall relative mutability among mutant residues is highest for cysteine and tryptophan. The mutability values are higher for mutated residues than for mutant residues. Arginine and glycine are among the most mutated residues in all secondary structures whereas the other amino acids have large variations in mutability between structure types. Statistical analysis was used to reveal trends in different secondary structural elements, residue types as well as for the charge and volume changes.


Now that the sequence of the human genome is almost complete, the research interest in genomics has moved from determining the sequence to the analysis of genetic variations, e.g. the Human Variome Project [1], and collecting data in locus-specific mutation databases [2, 3]. There is also a race to develop methods for cost-effective sequencing of the genomes of individuals [4]. Missense mutations in coding DNA, which lead to single amino acid changes in proteins, are commonly linked to human disorders [5]. The number of documented disease-linked missense and nonsense mutations is close to 30,000 [6]. A disease phenotype can arise because an amino acid change results in the loss of a critical protein function, in structural alterations, or because the mutation leads to "gain of function" effects such as functional dysregulation or the formation of toxic aggregates [79].

Only correctly folded proteins can deliver all the functional properties of a protein. Even minor changes in the size or properties of an amino acid side chain can alter or prevent the function of the protein. On the other hand, even large deletions or insertions may be tolerated in numerous positions within a protein [10]. The effects of mutations are also dependent on the protein sequence and structure context of the alteration. General statistical analyses have been performed for disease-causing mutations, for non-synonymous SNPs (nsSNPs) [9, 1116], for groups of diseases, such as immunodeficiencies [3], and for groups of proteins, such as protein kinases [17]. Based on these studies and others, a number of methods have been developed for the prediction of tolerance and the consequences of mutations [13, 1823].

Structural information is needed to fully understand the effects and consequences of mutations, whether disease-causing or used purposefully to modify the properties of a protein e.g. in protein engineering. Three-dimensional structures and computer models have been used by us and others to elucidate disease mechanisms from amino acid substitutions e.g. in references [2426]. We have reviewed and discussed the applicability of more than 30 sequence and structure utilizing methods to predict the outcome of missense mutations to explain the basis of diseases [27]. Activity modifying mutations are also valuable for understanding the functions and conservation of amino acids and sequence regions in protein families. Recently, all the possible amino acid substitutions and their effects on biophysical properties were investigated in five proteins [28].

Secondary structural elements, α-helices, β-strands, turns and bends, are basic structural components of protein scaffolds. Amino acids are differently distributed between these elements. This information has been utilized for decades to predict the location of secondary structures from sequence information e.g. in references [2934]. Secondary structures are common regular conformations of polypeptides, and they are the most energetically favourable structures. Each secondary structure type has characteristic backbone φ and ψ angles. Secondary structures fold together in proteins and form higher order structures such as super secondary structures, motifs, domains and tertiary structures. The organization of the folds is very similar in related protein structures even though the sequence identities can be very small. Secondary structures are generally of substantial length and can pass through the hydrophobic core of globular proteins. Within protein families the secondary structures are more conserved than the surface loops connecting the adjacent elements.

Since secondary structures are structural building blocks that cover some 25 – 75% of the length of proteins, it would be interesting to know how disease-related, and thus function and/or structure altering, mutations affect these elements. We investigated the occurrence, location and distribution of disease-causing mutations in secondary structures. The study is based on statistics and bioinformatical analysis of three-dimensional structures and protein sequence information. Not many differences occur in mutation types between secondary structure elements and whole proteins. Clear differences were observed within the mutation spectra for different secondary structural elements and regions outside secondary structures. Some features, like the overrepresentation of arginines, were evident in all the secondary structures. Our analysis covers different amino acid substitutions, alterations of physicochemical properties, and sequence conservation. We investigated mutations both in the mutated original residues and in the mutant, altered, amino acids.

Results and discussion

Secondary structures have an energetically favourable organization of the polypeptide chain. Our aim was to obtain a comprehensive map of the differences in mutation frequencies, location, contact energies and changes in residue volume and charge, both in the mutated amino acids and in the mutant amino acids, for the different secondary structure types. We collected information for 44 proteins involved in a large number of diseases (Table 1). The criteria for choosing the proteins were a relatively large number of reported missense mutations and the availability of the three-dimensional structure. The genes in Table 1 are listed with the recommended HGNC names (HUGO Gene Nomenclature Committee) [35, 36]. The number of missense mutations varied from 8 to 240 per investigated protein or domain. The proteins represent different activities and functions including enzymes, signalling proteins, membrane proteins, receptors etc. 42 of the total of 46 structures had a resolution greater than 2.00 Å. There are more PDB entries than proteins because for the large BTK and VDR proteins there are two structures for different domains. The studied proteins contained altogether 2413 mutations of which 1935 (80%) appeared in secondary structures. The amino acid composition of all the proteins is in Figure 1. Considering the large size of the dataset and diversity of protein types and functions the statistically significant results reveal the true nature of disease-causing amino acid changes. In the χ2-test, results were considered significant with a P value < 0.05.

Table 1 Summary of analysed proteins and diseases
Figure 1

Amino acid distribution and relative mutability. Top row. Amino acid distribution in the investigated proteins (left), overall relative mutability of mutated (middle) and mutant residues (right). The same information is on the second row only for α-helices, third row for β-strands, fourth row for turns and bends, and in the bottom row for structures outside secondary structural elements.

The total chain length of the 44 investigated proteins is 12540 amino acids. The secondary structure elements consist as follows: all helix structures altogether 4567 amino acids of which α-helices 4118 (~90%), 310-helices 449 (~10%) and π-helices just 2 residues. There are 153 amino acids in β-bridges and 2530 in β-ladders, altogether 2683 (27%), and there are 1436 (15%) and 1190 (12%) residues in turns and bends, respectively. 18 proteins belong to SCOP [37] class for all-α proteins, 12 proteins belong to all-β proteins, 22 proteins belong to α and β proteins (14 α/β and 8 α+β) and 2 to coiled coil proteins. Btk PH domain belongs to all-β proteins and kinase domain to α+β proteins. In VDR, nuclear receptor ligand-binding domain belongs to all-α proteins and glucocorticoid receptor-like (DNA-binding domain) to SCOP class for small proteins. Altogether there were 1939 mutations in the secondary structures, 48% in helices, 28% in β-structures, and 24% in turns and bends. These numbers follow the distribution of the structure elements. The length of the secondary structures varied as follows: α-helices from 1 to 46 residues, average length being 11.3 residues, 310-helices from 2 to 9 residues, average 3.4. In β-strands the length varied from 1 to 22, average 5.28. Length of the structure in turns varied from 1 to 8 residues, average 2.1, and in bends from 1 to 6, average 1.6.

Overall amino acid mutational spectrum

Table 2 summarizes all the mutations in the studied proteins and Table 3 mutations in the secondary structures only. DSSP classifies secondary structures into three helix types, two extended β-strand types, turns and bends. The helices represent the classical α-helix, π-helix and 310-helix. The difference between the helix types originates from the number of residues per turn. In the normal right-handed α-helix there are 3.6 residues per turn, while π-helix has 4.4, and 310-helix only three. In helical structures the main chain residues form stabilizing hydrogen bonds with residues further in the sequence. All the side chains are pointing out from the helical core. A large proportion of α-helices are amphipathic i.e. one face of the helix is hydrophilic and the other hydrophobic [38]. The 310-helix is the fourth most common type of secondary structure in proteins after α-helices, β-strands, and reverse turns [39]. 310-helices commonly appear as N- or C-terminal extensions of α-helices. They are typically only three residues long compared with a mean of 10–12 residues for α-helices [40]. β-Strands are divided into isolated β-bridges and extended strands. β-Strands can form structure stabilizing hydrogen bonds only with another strand in a parallel or antiparallel manner. Bends and turns are, according to DSSP terminology, relatively short structures that are located between the helical and strand elements. Well organized β- and γ-turns reverse the direction of the polypeptide chain. Only glycines are allowed in certain turns at certain positions due to steric restrictions in very tight bends.

Table 2 Spectrum of mutations in all residues in the studied proteinsa
Table 3 Spectrum of mutations appearing in α-helix, β-strand, turn and bend structures, expected values are calculated from mutated and mutant amino acid composition in the studied proteinsa

First we investigated the statistical significance for mutated and mutant amino acids located in secondary structures compared to overall distribution. Then the patterns within each secondary structural element were revealed.

Arginine is clearly the most mutated residue type, as already indicated in previous studies [4143]. The very high mutability of codons for arginine arises from CpG dinucleotides, which can spontaneously mutate by deamination either to TG or CA dinucleotides [44]. Arginine is coded by six codons, four of which have a CpG dinucleotide in the first and second codon position. In addition to the CpG dinucleotide we have also shown previously that the surrounding sequence context has an effect on mutability [42].

We calculated the relative mutability of all the amino acids as both mutant and mutated residues (Fig. 1). The residue that had the lowest ratio of observed vs. expected mutations (marked '1') was used as a reference in these calculations. This residue is lysine for mutated and alanine for mutant residues.

Among the original residues, arginine has clearly the highest relative mutability. The picture is completely different for the mutant residues, where the overall relative mutability among mutant residues is highest for cysteine and tryptophan. Table 2 shows that C, D, H, K, P and W are highly mutated residues. In the second place after arginine in mutated residues is cysteine. Glycine and tryptophan are also highly significantly overrepresented among mutated residues. Proline is the only amino acid forming a ring with the backbone and it has very rigid structure, which bends the main chain of the protein in a characteristic way. Proline is a known breaker of secondary structures [30]. K, E, Q, F and T are significantly underrepresented among mutated residues.

From Table 3 can be seen that the distribution of mutations in secondary structural elements is very close to that expected based on the mutation count in the whole proteins. Only S as mutated and M as mutant residue type are significantly underrepresented. Based on the amino acid distributions (Fig. 1) the composition of α-helices is close to the general distribution whereas β-structures and turns and bends have clearly different distribution. The amino acid compositions within secondary structures are very different. The share of α-helices is 36%, β-structures 21% and turns and bends 21% (remaining 21% is in areas not defined to any secondary structure type) of all the residues in the investigated structures. The secondary structures harbour 80% of all the disease-causing mutations. The amino acid distribution outside the regular secondary structural elements is very similar to whole protein except for high overrepresentation for P. The trends for mutated and mutant residues are similar to whole proteins. R has very high relative mutability for original residues (Fig. 1).

The only remarkable difference between Tables 2 and 4 (data calculated based on the amino acid distribution in secondary structures) is methionine, which is highly overrepresented in Table 2 for all the mutations as a mutant residue, but has no statistical correlation in the secondary structures (Table 4). A, L and I appear much less frequently amongst the replacing residues than expected. Our results are in line with those published previously for general mutation distribution [11, 12].

Table 4 Spectrum of mutations appearing in α-helix, β-strand, turn and bend structures. Expected values are calculated from amino acid composition in secondary structural elementsa

Mutational spectrum within α-helices, β-strands and turns and bends

The amino acid composition in α-helices is very similar to overall amino acid composition (Fig. 1). The only main difference is in the ratio of glycines and prolines, which are clearly depleted in the helices, whereas glycines, as expected, are especially strongly overrepresented in turns and bends. The most prominent residues in β-strands are aliphatic isoleucine, leucine and valine followed by alanine, phenylalanine and threonine. Aspartic acid is surprisingly frequent in turns and bends.

In the analysis of helix mutations in all secondary structures [see Additional file 1] glycine is very significantly underrepresented as mutated residue and also A and N have statistically significant values. M is the only significant mutant residue type. Correspondingly V, I and S have significant chi square values as mutated residues in β-strand, S, V and W as mutated residues [see Additional file 2]. In turns and bends the significantly mutated amino acids are S, V, A, D, I, K and L and as mutant residues E, M, and R [see Additional file 3].

We further investigated the mutations within these secondary structures. The distribution of disease-causing mutations in helical structures in the investigated 44 structures is shown in Table 5. The number of mutations in helices is altogether 928 (48%), whereas β-strand structures of extended strands and isolated β-bridges contain 550 (28%) missense mutations (Table 6) and hydrogen bonded turns and bends contain 461 (24%) mutations (Table 7). The results for original and mutant residues are clearly and significantly very different for the different structural elements. In helices, besides arginine and cysteine other residues are not very highly mutated. If only the secondary structures are taken into account, glycine appears to be the most mutated residue after arginine. This result is similar to a Steward et al. study that revealed that glycine was the second most frequently mutating amino acid in the OMIM disease database [9].

Table 5 Spectrum of mutations in α-helicesa
Table 6 Spectrum of mutations in β-strandsa
Table 7 Spectrum of mutations in turns and bendsa

In helices, lysine is a strongly underrepresented residue compared with the calculated expected value. P, C, Q, W and D show statistically significant overrepresentation when compared to expected values, while S and L are underrepresented among mutant amino acids. Relative mutability in α-helices indicates that arginine is 7.29 times more mutated than expected whereas glycine is 4.88 times so.

In β-structures, R is statistically very significantly overrepresented and G and H have weaker enrichment. T is the statistically least mutated residue in β-strands. Among mutant residues, C is highly overrepresented and A underrepresented. Otherwise the results are less biased than in helices.

R is also statistically the most overrepresented residue in turns and bends although the χ2 score is smaller than for the other secondary structures. G is the only other statistically highly significantly overrepresented amino acid. G is the most flexible residue because it does not have a side chain. It often appears in tight turns where no other residue can replace it. K is highly underrepresented among the mutated residues. Interestingly, R is the most enriched amino acid among mutant residues.

The distributions of amino acid frequencies and relative mutabilities in the different structural elements (Fig. 1) are clearly very different. First of all, the values are higher for mutated residues than for mutant residues, indicating that the original residue in many instances is very important and substitutions to any other residue are not possible without detrimental effects. Arginine and glycine are among the most mutated residues in all secondary structures whereas the other amino acids have large variations between the secondary structure types. The effect of arginine can also be seen in the data for mutant residues. Point mutations in arginine lead mostly to cysteine and glutamine, which have relatively high mutability values. The mutant residues in turns and bends have a more even distribution than the other two elements because practically any mutation to key residues in a tight hairpin turn leads either to the loss of hydrogen bonds or does not allow tight turn formation and leads to consequent structural alterations.

The data for mutations outside regular elements (Table 8) indicate that R, H and Y are significantly overrepresented and Q, K and E underrepresented among mutated residues while N is the only residue type that is significantly overrepresented.

Table 8 Mutated and mutant residues not in secondary structuresa

Mutational spectrum in amino acid groups within structural elements

Amino acids can be grouped based on their physicochemical nature. The reason for performing an analysis with residue categories is that in many sites the specific amino acid is not very important but the suitable properties it provides is. This can be seen e.g. in multiple sequence alignments of protein families. We used the following six groups: hydrophobic (V, I, L, F, M, W, Y, C), positively charged (R, K, H), negatively charged (D and E), conformational (G and P), polar (N, Q, S) and (A and T) [45]. These groups follow the amino acid substitution matrices used for sequence alignments and database searches. The results for amino acid frequencies in the secondary structure types are in Tables 9, 10, 11. Different groups are over- or underrepresented in different structure types. Negatively charged residues, due to arginine, are overrepresented in all three tables. Conformational residues are also overrepresented in all three secondary structure classes, but with varying χ2-values. Positively charged amino acids are significantly underrepresented in helices and overrepresented in turns and bends.

Table 9 Spectrum of mutations in α-helices in amino acid groupsa
Table 10 Spectrum of mutations in β-strands in amino acid groupsa
Table 11 Spectrum of mutations in turns and bends in amino acid groupsa

Mutations to polar residues are significantly depleted in α-helices and especially so in turns and bends. The number of significant observations is lower in mutant residues. Negatively charged amino acids are overrepresented in both β-structures and in bends and turns, although the enrichment is relatively weak. Alanine and threonine are underrepresented in both helices and β-strands. These results likely reflect the importance of the original residue. Many disease-causing mutations affect the same positions indicating that the site is structurally or functionally important, where any mutation disrupts the protein activity.

Changes in amino acid volume and charge due to mutations

Next we examined the differences in volume and charge between the original residue and the replacing mutant residue (Fig. 2). The replacing residues within α-helices and β-strand structures are generally physically smaller than the original residues while in turns and bends the volume is remarkably increased. In 310-helices the replacing residue is on average 13.5 Å3 smaller than the original residue and in α-helices almost 4 Å3 smaller. Mutations to residues in β-bridges increase the volume, while mutations in β-strands on average reduce the volume of the amino acid. The increase in volume in turns is 22.6 Å3 and in bends 15.8 Å3, thus on average the replacing residue is much larger than the original residue. In turns and bends, glycines form the biggest group among mutated native amino acids. Because glycine is the smallest amino acid, all mutations in G increase the volume of the protein. Differences at the amino acid level show that there are clear peaks in some residues, and that the changes in all secondary structure types are very similar (Fig. 3). Glycine (68 Å3) is replaced by residues that are on average 85 Å3 larger in turns and bends, 83 Å3 larger in α-helices and 78 Å3 larger in β-strands. The largest residues are W, Y, F and R. The largest amino acid, tryptophan (237 Å3) is replaced by residues whose volume is on average 100 Å3 smaller in β-strands and 94 Å3 and 93 Å3 smaller in α-helices and turns and bends, respectively. Being a bulky residue, tryptophan is very rare in turns and bends. In our dataset W accounted for 1% of all residues in turns and bends. There was just one mutated W in bends and three in turns.

Figure 2

Changes to properties of amino acids caused by mutations. Average changes per residue in residue volumes when comparing original amino acid and mutated amino acid in A) α-helices, β-strands, turns and bends and B) separately in α-helices (H), 310-helices (G), extended strands (E), isolated β-bridges (B), turns (T) and bends (S). C) Average changes per residue in charges when comparing original amino acid and mutated amino acid in α-helices, β-strands, turns and bends and D) separately in α-helices, 310-helices, extended strands, isolated β-bridges, turns and bends.

Figure 3

Changes in mutations to residue volumes. Average changes in mutations to residue volumes in α-helices (red), β-strands (blue), turns and bends (green). The thick line indicates the original amino acid volumes. Outer rings indicate addition in volume and inner rings reduction, in steps of 35 Å3.

Average changes in isoelectric point (IEP) are similar for the main secondary structure types (Fig. 2C). In all the structures the IEP is increased on average compared to the normal structure. There are differences in the extent of the change, the largest changes are in turns and bends – on average close to 0.8 pI. The reason for the dramatic average increase in pI values is the introduction of prolines and lysines that have relatively high pI values. Arginine has the highest pI value of all amino acids, but in turns and bends there is almost equally high enrichments of arginines both in mutated and mutant residues, so arginines do not affect IEP crucially. Figure 2D also shows in detail the changes in different helices and β-structures. The change is much larger in 310-helices than in α-helices. A similar situation is seen for bends in which mutations lead to clearly larger changes. Differences at the amino acid level show that changes in charge in α-helices and β-strand structures are minor but in turns and bends there are clear peaks in some of the residues. C and E are replaced by residues with lower pI values and basic K and R by residues with higher pI values (Fig. 4).

Figure 4

Changes in mutations to residue charges. Average changes in mutations to residue charges in α-helices (red), β-strands (blue), turns and bends (green). The thick line indicates the original amino acid charges. Outer rings indicate higher charge values and inner rings lower, in steps of 1.25 pI.

The role of contact energies

We used RankViaContact [46] to calculate residue contact energies for all proteins in the dataset based on three-dimensional structures. The contact energies of the mutated residues range from very strong -27.6 to 7.8. The mutations are ranked based on their calculated contact energies. A slight majority (55%) of the mutations have strong or very strong contact energies. Most residues with strong contact energies are important for the stability of the protein structure. Many of the disease-related mutations are thus located in crucial structural sites in which alterations are deleterious. In figure 5A the count of contact energies of the mutated residues has been calculated in all proteins and separately for secondary structures. In β-strand the distribution of contact energies are even, but in other structures the distribution is biased towards weak positive contact energies.

Figure 5

Distribution of contact energies for mutated amino acids. The energies were calculated with the RankViaContact program. A) Data for all mutations, and B) distribution in secondary structural elements α-helices (red), β-strands (blue), turns and bends (green), outside secondary structures (yellow), and whole proteins (black).

Figure 5B shows the contact energy distribution for mutations in proteins and secondary structure types as well as outside the secondary structural elements. The curves have quite similar overall shapes although the location of the peak for the maximal occurrence varies. Of note is also that the mutation positions in turns and bends and outside secondary structures do not appear in sites of very strong contacts. Mutation sites in β-structures have very even distribution throughout the contact energy range.

We organized the residues into six groups based on their physicochemical properties and calculated the percentages of strong and weak contact energies in the groups (Fig. 6). Among hydrophobic residues more than 90% of the original amino acids have strong contact energies. Polar, conformational, positively and negatively charged mutated residues have for the most part weak contact energies (70–85%). A and T residues have mainly strong contact energies. These results indicate the importance of the residue type for interactions – whether hydrophobic, van der Waals or electronic interactions. Charged residues often form salt bridges, while hydrophobic and aliphatic amino acids are involved in weaker interactions. A large proportion of mutated residues are forming interactions which are essential for the protein and its structure. Structural alterations are the most common consequence of disease-causing mutations [18].

Figure 6

Mutant residues with strong and weak contact energies. The percentage of mutant residues with strong (black) and weak (white) contact energies in physicochemical amino acid groups.


Germline substitutions leading to the replacement of a native residue can result in either benign effects (e.g. polymorphisms) or in genetic disease. It has been shown that most positions in proteins can be altered without serious effects on protein structure or function [47]. On the other hand, the majority of disease-causing mutations have structural effects [14]. Therefore, mutations that are phenotypic for disease indicate the importance for the specific location. Knowledge of the molecular basis of mutations is important and can be used in several ways.

Mutation spectra are clearly different for the different structural elements. The most prominent feature for all the elements is the strong overrepresentation of arginine as mutated residue. Large changes in the properties of mutant amino acids, in volume or charge, are disease-related, but there are large differences for the structural elements. Contact energy distributions of mutated residues are surprisingly similar except for β structures. About half of the mutation sites are involved in strong or very strong amino acid interactions. In conclusion, there are many and strong trends in mutant and mutated residues. The trends are statistically significantly different for the different secondary structures.

Previously, mutation statistics have been studied at a general level [8, 12, 48] as well as a structural level [9, 12], but detailed analysis of the spectrum and effects of mutations within secondary structural elements has been missing. In the study of Ferrer-Costa et al. [12] the dataset consisted of 1169 disease-associated single amino acid polymorphisms (daSAPs) distributed over 73 proteins with structure information from the PDB. The study combines information from all secondary structure elements without element specific data. In the study of Steward and co-workers [9] 63 proteins contained 1292 disease-associated sites. They did not analyse secondary structures separately. Our study involved 1939 missense mutations in secondary structures and 2411 mutations altogether. We localized the majority of missense mutations to α-helices. The biggest group of daSAPs is found in coils, and helices contain 36.7% of all daSAPs. Ferrer-Costa et al. also included volume comparisons in their study. In contrast to our work, the size changes to daSAPs were calculated for the whole dataset. We calculated the changes for each secondary structure type separately, also at amino acid level. Arginine is the most commonly mutated residue. The result is identical to Steward et al [9]. Wang and Moult [18] examined 262 missense mutations in 23 proteins and concluded that 80% of mutations destabilize the protein structure relative to the folded state based on changes in hydrophobic burial, backbone strain, overpacking, and electrostatic interactions. In several other studies, structural properties have been combined with a sequence profile for the mutated position [21, 23, 49], which reflects the occurrence of other amino acid types at corresponding sites in homologous proteins.

Vitkup et al. [11] investigated in total, 4236 mutations from 436 genes and concluded that mutations at arginine and glycine residues are together responsible for about 30% of genetic diseases. This result is similar to ours. In our dataset, 25% of all missense mutations occur in R and G, and 23% are present in the examined secondary structures. Vitkup et al. also found that random mutations at tryptophan and cysteine have the highest probability of causing disease, which is in line with our results. The overall relative mutability to mutant residues is 3.46 for C and 3.31 for W compared to alanine which has the lowest ratio of observed vs. expected mutations and is considered as 1.

To correlate the mutated and mutant types to protein structures it is important to know at which secondary structural element the alteration appears. Protein function and interactions require both stability and specificity. Proteins fold according to the minimum free energy. In contrast, they organize themselves to recognize a transition state or a ligand [50]. Amino acids and mutations have very distinct differences in the frequencies in the secondary structural elements. Multiple sequence alignments can be used to investigate allowed substitutions in protein families. Our analysis revealed mutation types which are most likely deleterious. This information could also be used for the development and optimization of amino acid comparison tables for individual secondary structural elements. Our data could also add to the reliability of the predictions of mutation effects.


We investigated proteins for which numerous disease-causing missense mutations were available along with an experimentally defined three-dimensional structure. Mutation data was collected from BTKbase [5153] for Btk mutations, CD40Lbase [5456] for CD40L mutations and from Human Gene Mutation Database (HGMD) [6] for the remainder. The publicly available mutations are in the supplementary table 4 [see Additional file 4], full list is available from authors by request. The DSSP program [57] was used to identify secondary structures for three helix types, two extended β-strand types, turns and bends, based on the stereochemistry of the protein structure. The DSSP program was used to systematically assign secondary structures for each residue in the three-dimensional structures obtained from the Protein Data Bank (PDB) [58]. A Perl script was written to connect mutation information to protein structure data (dssp files).

If more than one structure existed for a protein or a protein domain, the one with the highest resolution and the longest chain length was chosen. All data was stored in a mySQL database. Altogether we investigated 44 proteins with a total of 12540 residues, of which 9878 were in the examined secondary structures. The proteins contained 2411 different disease-causing mutations, 1939 present in secondary structures.

The RankViaContact service was used to calculate residue-residue contact energies based on a coarse-grained model [46]. The energy parameters used for residue-residue contacts [59] were derived considering the secondary structural environments. The contact energies were estimated for all the missense mutations located in the secondary structures.

Mutation statistics were analysed by comparing the frequencies of the obtained mutations with the expected values. Expected values for mutated residues within α-helices, β-strands and turns and bends were calculated using the distribution of all amino acids in respective secondary structure. In the case of the mutant amino acids within different secondary structures, the expected values were calculated from codon diversity by taking into account all possible amino acid substitutions. In order to reveal how the mutation distribution in secondary structures compares to the overall distribution of mutations in the dataset, the expectation values for mutated and mutant residues were calculated based on all the mutations in secondary structures.

The χ2 test was used to determine the significance of the results. Chi square values were calculated using the following formula:

χ 2 = Σ ( f o f e ) 2 , f e MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFhpWydaahaaWcbeqaaiabikdaYaaakiabg2da9iabfo6atnaalaaabaGaeiikaGIaemOzay2aaSbaaSqaaiabd+gaVbqabaGccqGHsislcqWGMbGzdaWgaaWcbaGaemyzaugabeaakiabcMcaPmaaCaaaleqabaGaeGOmaiJaeiilaWcaaaGcbaGaemOzay2aaSbaaSqaaiabdwgaLbqabaaaaaaa@3F79@

where f o is the observed frequency and f e is the expected frequency for an amino acid. P-values and 95% confidence intervals were estimated in one-tailed fashion. Relative mutability was calculated using the formula:

R m ( N ) = ( N o b s N e x p ) ( N o b s N e x p ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGsbGucqWGTbqBcqGGOaakcqWGobGtcqGGPaqkcqGH9aqpdaWcaaqaaiabcIcaOiabd6eaonaaBaaaleaacqWGVbWBcqWGIbGycqWGZbWCaeqaaOGaeyyXICTafmOta4KbauaadaWgaaWcbaGaemyzauMaemiEaGNaemiCaahabeaakiabcMcaPaqaaiabcIcaOiqbd6eaozaafaWaaSbaaSqaaiabd+gaVjabdkgaIjabdohaZbqabaGccqGHflY1cqWGobGtdaWgaaWcbaGaemyzauMaemiEaGNaemiCaahabeaakiabcMcaPaaacqGGSaalaaa@5235@

where N' is the least mutated residue type that was obtained by calculating the ratio between observed and expected value. N represents the number of mutated original or mutant residues for an amino acid type. The relative mutability was calculated for each residue type in all the investigated secondary structural elements.

In order to calculate changes in the isoelectric point we used the Emboss iep program [60] to predict IEPs for native and mutant proteins. The average change in IEP in each secondary structure type was calculated. The changes in residue volumes in different secondary structure types were obtained by using residue volumes [61]. The results were weighted by the amount of individual mutations and averaged by the number of mutant residues in respective secondary structures.


  1. 1.

    Cotton RG, Kazazian HH Jr.: Toward a Human Variome Project. Hum Mutat 2005, 26: 499. 10.1002/humu.20272

    Article  Google Scholar 

  2. 2.

    Horaitis O, Cotton RG: The challenge of documenting mutation across the genome: the human genome variation society approach. Hum Mutat 2004, 23: 447–452. 10.1002/humu.20038

    CAS  Article  PubMed  Google Scholar 

  3. 3.

    Piirilä H, Väliaho J, Vihinen M: Immunodeficiency mutation databases (IDbases). Human Mutation 2006., in press:

    Google Scholar 

  4. 4.

    Service RF: Gene sequencing. The race for the $1000 genome. Science 2006, 311: 1544–1546. 10.1126/science.311.5767.1544

    CAS  Article  PubMed  Google Scholar 

  5. 5.

    Botstein D, Risch N: Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet 2003, 33 Suppl: 228–237. 10.1038/ng1090

    Article  PubMed  Google Scholar 

  6. 6.

    Stenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, Thomas NS, Abeysinghe S, Krawczak M, Cooper DN: Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat 2003, 21: 577–581. 10.1002/humu.10212

    CAS  Article  PubMed  Google Scholar 

  7. 7.

    Dobson CM: Protein folding and misfolding. Nature 2003, 426: 884–890. 10.1038/nature02261

    CAS  Article  PubMed  Google Scholar 

  8. 8.

    Sanders CR, Myers JK: Disease-related misassembly of membrane proteins. Annu Rev Biophys Biomol Struct 2004, 33: 25–51. 10.1146/annurev.biophys.33.110502.140348

    CAS  Article  PubMed  Google Scholar 

  9. 9.

    Steward RE, MacArthur MW, Laskowski RA, Thornton JM: Molecular basis of inherited diseases: a structural perspective. Trends Genet 2003, 19: 505–513. 10.1016/S0168-9525(03)00195-1

    CAS  Article  PubMed  Google Scholar 

  10. 10.

    Poussu E, Vihinen M, Paulin L, Savilahti H: Probing the a-complementing domain of E. coli b-galactosidase with use of an insertional pentapeptide mutagenesis strategy based on Mu in vitro DNA transposition. Proteins 2004, 54: 681–692. 10.1002/prot.10467

    CAS  Article  PubMed  Google Scholar 

  11. 11.

    Vitkup D, Sander C, Church GM: The amino-acid mutational spectrum of human genetic disease. Genome Biol 2003, 4: R72. 10.1186/gb-2003-4-11-r72

    PubMed Central  Article  PubMed  Google Scholar 

  12. 12.

    Ferrer-Costa C, Orozco M, de la Cruz X: Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties. J Mol Biol 2002, 315: 771–786. 10.1006/jmbi.2001.5255

    CAS  Article  PubMed  Google Scholar 

  13. 13.

    Ramensky V, Bork P, Sunyaev S: Human non-synonymous SNPs: server and survey. Nucleic Acids Res 2002, 30: 3894–3900. 10.1093/nar/gkf493

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  14. 14.

    Yue P, Li Z, Moult J: Loss of protein structure stability as a major causative factor in monogenic disease. J Mol Biol 2005, 353: 459–473. 10.1016/j.jmb.2005.08.020

    CAS  Article  PubMed  Google Scholar 

  15. 15.

    Sunyaev S, Hanke J, Aydin A, Wirkner U, Zastrow I, Reich J, Bork P: Prediction of nonsynonymous single nucleotide polymorphisms in human disease-associated genes. J Mol Med 1999, 77: 754–760. 10.1007/s001099900059

    CAS  Article  PubMed  Google Scholar 

  16. 16.

    Sunyaev S, Lathe W 3rd, Bork P: Integration of genome data and protein structures: prediction of protein folds, protein interactions and "molecular phenotypes" of single nucleotide polymorphisms. Curr Opin Struct Biol 2001, 11: 125–130. 10.1016/S0959-440X(00)00175-5

    CAS  Article  PubMed  Google Scholar 

  17. 17.

    Ortutay C, Väliaho J, Stenberg K, Vihinen M: KinMutBase: a registry of disease-causing mutations in protein kinase domains. Hum Mutat 2005, 25: 435–442. 10.1002/humu.20166

    CAS  Article  PubMed  Google Scholar 

  18. 18.

    Wang Z, Moult J: SNPs, protein structure, and disease. Hum Mutat 2001, 17: 263–270. 10.1002/humu.22

    Article  PubMed  Google Scholar 

  19. 19.

    Ng PC, Henikoff S: Predicting deleterious amino acid substitutions. Genome Res 2001, 11: 863–874. 10.1101/gr.176601

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  20. 20.

    Chen H, Zhou HX: Prediction of solvent accessibility and sites of deleterious mutations from protein sequence. Nucleic Acids Res 2005, 33: 3193–3199. 10.1093/nar/gki633

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  21. 21.

    Chasman D, Adams RM: Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. J Mol Biol 2001, 307: 683–706. 10.1006/jmbi.2001.4510

    CAS  Article  PubMed  Google Scholar 

  22. 22.

    Bao L, Cui Y: Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information. Bioinformatics 2005, 21: 2185–2190. 10.1093/bioinformatics/bti365

    CAS  Article  PubMed  Google Scholar 

  23. 23.

    Saunders CT, Baker D: Evaluation of structural and evolutionary contributions to deleterious mutation prediction. J Mol Biol 2002, 322: 891–901. 10.1016/S0022-2836(02)00813-6

    CAS  Article  PubMed  Google Scholar 

  24. 24.

    Vihinen M, Vetrie D, Maniar HS, Ochs HD, Zhu Q, Vorechovský I, Webster AD, Notarangelo LD, Nilsson L, Sowadski JM, Smith CIE: Structural basis for chromosome X-linked agammaglobulinemia: a tyrosine kinase disease. Proc Natl Acad Sci U S A 1994, 91: 12803–12807. 10.1073/pnas.91.26.12803

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  25. 25.

    Rong SB, Vihinen M: Structural basis of Wiskott-Aldrich syndrome causing mutations in the WH1 domain. J Mol Med 2000, 78: 530–537. 10.1007/s001090000136

    CAS  Article  PubMed  Google Scholar 

  26. 26.

    Lappalainen I, Vihinen M: Structural basis of ICF-causing mutations in the methyltransferase domain of DNMT3B. Protein Eng 2002, 15: 1005–1014. 10.1093/protein/15.12.1005

    CAS  Article  PubMed  Google Scholar 

  27. 27.

    Thusberg J, Vihinen M: Bioinformatic analysis of protein structure-function relationships: case study of leukocyte elastase (ELA2) missense mutations. Hum Mutat 2006, 27: 1230–1243. 10.1002/humu.20407

    CAS  Article  PubMed  Google Scholar 

  28. 28.

    Terp BN, Cooper DN, Christensen IT, Jorgensen FS, Bross P, Gregersen N, Krawczak M: Assessing the relative importance of the biophysical properties of amino acid substitutions associated with human genetic disease. Hum Mutat 2002, 20: 98–109. 10.1002/humu.10095

    CAS  Article  PubMed  Google Scholar 

  29. 29.

    Gascuel O, Golmard JL: A simple method for predicting the secondary structure of globular proteins: implications and accuracy. Comput Appl Biosci 1988, 4: 357–365.

    CAS  PubMed  Google Scholar 

  30. 30.

    Chou PY, Fasman GD: Prediction of protein conformation. Biochemistry 1974, 13: 222–245. 10.1021/bi00699a002

    CAS  Article  PubMed  Google Scholar 

  31. 31.

    Rost B, Sander C: Secondary structure prediction of all-helical proteins in two states. Protein Eng 1993, 6: 831–836. 10.1093/protein/6.8.831

    CAS  Article  PubMed  Google Scholar 

  32. 32.

    Robson B, Pain RH: Analysis of the code relating sequence to conformation in proteins: possible implications for the mechanism of formation of helical regions. J Mol Biol 1971, 58: 237–259. 10.1016/0022-2836(71)90243-9

    CAS  Article  PubMed  Google Scholar 

  33. 33.

    Robson B, Suzuki E: Conformational properties of amino acid residues in globular proteins. J Mol Biol 1976, 107: 327–356. 10.1016/S0022-2836(76)80008-3

    CAS  Article  PubMed  Google Scholar 

  34. 34.

    Garnier J, Osguthorpe DJ, Robson B: Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J Mol Biol 1978, 120: 97–120. 10.1016/0022-2836(78)90297-8

    CAS  Article  PubMed  Google Scholar 

  35. 35.

    Wain HM, Lush MJ, Ducluzeau F, Khodiyar VK, Povey S: Genew: the Human Gene Nomenclature Database, 2004 updates. Nucleic Acids Res 2004, 32: D255–7. 10.1093/nar/gkh072

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  36. 36.

    Wain HM, Bruford EA, Lovering RC, Lush MJ, Wright MW, Povey S: Guidelines for human gene nomenclature. Genomics 2002, 79: 464–470. 10.1006/geno.2002.6748

    CAS  Article  PubMed  Google Scholar 

  37. 37.

    Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540. 10.1006/jmbi.1995.0159

    CAS  PubMed  Google Scholar 

  38. 38.

    Cornette JL, Cease KB, Margalit H, Spouge JL, Berzofsky JA, DeLisi C: Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. J Mol Biol 1987, 195: 659–685. 10.1016/0022-2836(87)90189-6

    CAS  Article  PubMed  Google Scholar 

  39. 39.

    Barlow DJ, Thornton JM: Helix geometry in proteins. J Mol Biol 1988, 201: 601–619. 10.1016/0022-2836(88)90641-9

    CAS  Article  PubMed  Google Scholar 

  40. 40.

    Richardson JS, Richardson DC: Amino acid preferences for specific locations at the ends of a-helices. Science 1988, 240: 1648–1652. 10.1126/science.3381086

    CAS  Article  PubMed  Google Scholar 

  41. 41.

    Cooper DN, Youssoufian H: The CpG dinucleotide and human genetic disease. Hum Genet 1988, 78: 151–155. 10.1007/BF00278187

    CAS  Article  PubMed  Google Scholar 

  42. 42.

    Ollila J, Lappalainen I, Vihinen M: Sequence specificity in CpG mutation hotspots. FEBS Lett 1996, 396: 119–122. 10.1016/0014-5793(96)01075-7

    CAS  Article  PubMed  Google Scholar 

  43. 43.

    Krawczak M, Ball EV, Cooper DN: Neighboring-nucleotide effects on the rates of germ-line single-base-pair substitution in human genes. Am J Hum Genet 1998, 63: 474–488. 10.1086/301965

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  44. 44.

    Coulondre C, Miller JH, Farabaugh PJ, Gilbert W: Molecular basis of base substitution hotspots in Escherichia coli. Nature 1978, 274: 775–780. 10.1038/274775a0

    CAS  Article  PubMed  Google Scholar 

  45. 45.

    Shen B, Vihinen M: Conservation and covariance in PH domain sequences: physicochemical profile and information theoretical analysis of XLA-causing mutations in the Btk PH domain. Protein Eng Des Sel 2004, 17: 267–276. 10.1093/protein/gzh030

    CAS  Article  PubMed  Google Scholar 

  46. 46.

    Shen B, Vihinen M: RankViaContact: ranking and visualization of amino acid contacts. Bioinformatics 2003, 19: 2161–2162. 10.1093/bioinformatics/btg293

    CAS  Article  PubMed  Google Scholar 

  47. 47.

    Frillingos S, Sahin-Toth M, Wu J, Kaback HR: Cys-scanning mutagenesis: a novel approach to structure function relationships in polytopic membrane proteins. FASEB J 1998, 12: 1281–1299.

    CAS  PubMed  Google Scholar 

  48. 48.

    Partridge AW, Therien AG, Deber CM: Missense mutations in transmembrane domains of proteins: phenotypic propensity of polar residues for human disease. Proteins 2004, 54: 648–656. 10.1002/prot.10611

    CAS  Article  PubMed  Google Scholar 

  49. 49.

    Sunyaev S, Ramensky V, Koch I, Lathe W 3rd, Kondrashov AS, Bork P: Prediction of deleterious human alleles. Hum Mol Genet 2001, 10: 591–597. 10.1093/hmg/10.6.591

    CAS  Article  PubMed  Google Scholar 

  50. 50.

    Pauling L, Campbell DH, Pressman D: The nature of the forces between antigen and antibody and of the precipitation reaction. Physiol Rev 1943, 23: 203–219.

    CAS  Google Scholar 

  51. 51.

    Väliaho J, Smith CIE, Vihinen M: BTKbase: the mutation database for X-linked agammaglobulinemia. Hum Mutat 2006, 27: 1209–1217. 10.1002/humu.20410

    Article  PubMed  Google Scholar 

  52. 52.

    Vihinen M, Cooper MD, de Saint Basile G, Fischer A, Good RA, Hendriks RW, Kinnon C, Kwan SP, Litman GW, Notarangelo LD, Ochs HD, Rosen FS, Vetrie D, Webster ADB, Zegers BJM, Smith CIE: BTKbase: a database of XLA-causing mutations. International Study Group. Immunol Today 1995, 16: 460–465. 10.1016/0167-5699(95)80027-1

    CAS  Article  PubMed  Google Scholar 

  53. 53.


  54. 54.

    Notarangelo LD, Peitsch MC, Abrahamsen TG, Bachelot C, Bordigoni P, Cant AJ, Chapel H, Clementi M, Deacock S, de Saint Basile G, Duse M, Espanol T, Etzioni A, Fasth A, Fischer A, Giliani S, Gomez L, Hammarström L, Jones A, Kanariou M, Kinnon C, Klemola T, Kroczek RA, Levy J, Matamoros N, Monafo V, Paolucci P, Reznick I, Sanal O, Smith CIE, Thompson RA, Tovo P, Villa A, Vihinen M, Vossen J, Zegers BJM, Ochs HD, Conley ME, Iseki M, Ramesh N, Shimadzu M, Saiki O: CD40Lbase: a database of CD40L gene mutations causing X-linked hyper-IgM syndrome. Immunol Today 1996, 17: 511–516. 10.1016/0167-5699(96)30059-5

    CAS  Article  PubMed  Google Scholar 

  55. 55.

    Thusberg J, Vihinen M: The structural basis of hyper IgM deficiency - CD40L mutations. Protein Eng Des Sel 2007, 20: 133–141. 10.1093/protein/gzm004

    CAS  Article  PubMed  Google Scholar 

  56. 56.


  57. 57.

    Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22: 2577–2637. 10.1002/bip.360221211

    CAS  Article  PubMed  Google Scholar 

  58. 58.

    Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242. 10.1093/nar/28.1.235

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  59. 59.

    Zhang C, Kim SH: Environment-dependent residue contact energies for proteins. Proc Natl Acad Sci U S A 2000, 97: 2550–2555. 10.1073/pnas.040573597

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  60. 60.

    Emboss iep program[]

  61. 61.

    Pontius J, Richelle J, Wodak SJ: Deviations from standard atomic volumes as a quality measure for protein crystal structures. J Mol Biol 1996, 264: 121–136. 10.1006/jmbi.1996.0628

    CAS  Article  PubMed  Google Scholar 

Download references


Financial support from the Finnish Academy and the Medical Research Fund of Tampere University Hospital is gratefully acknowledged.

Author information



Corresponding author

Correspondence to Mauno Vihinen.

Additional information

Authors' contributions

SK and MV designed the study together. SK collected data, formed the database and performed the statistical analysis. All authors read and approved the final manuscript.

Electronic supplementary material


Additional file 1: Spectrum of mutations appearing in α-helix. Expected values are calculated from mutated and mutant amino acid composition in the studied proteins. (DOC 58 KB)


Additional file 2: Spectrum of mutations appearing in β-strand. Expected values are calculated from mutated and mutant amino acid composition in the studied proteins. (DOC 57 KB)


Additional file 3: Spectrum of mutations appearing in turn and bend structures. Expected values are calculated from mutated and mutant amino acid composition in the studied proteins. (DOC 57 KB)


Additional file 4: The publicly available mutations. List of analysed mutations obtained from publicly available databases, BTKbase, CD40Lbase and SwissProt (DOC 2 MB)

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Khan, S., Vihinen, M. Spectrum of disease-causing mutations in protein secondary structures. BMC Struct Biol 7, 56 (2007).

Download citation


  • Secondary Structure
  • Missense Mutation
  • Protein Data Bank
  • Secondary Structural Element
  • Mutant Amino Acid