Spectrum of disease-causing mutations in protein secondary structures

Background Most genetic disorders are linked to missense mutations as even minor changes in the size or properties of an amino acid can alter or prevent the function of the protein. Further, the effect of a mutation is also dependent on the sequence and structure context of the alteration. Results We investigated the spectrum of disease-causing missense mutations in secondary structure elements in proteins with numerous known mutations and for which an experimentally defined three-dimensional structure is available. We obtained a comprehensive map of the differences in mutation frequencies, location and contact energies, and the changes in residue volume and charge – both in the mutated (original) amino acids and in the mutant amino acids in the different secondary structure types. We collected information for 44 different proteins involved in a large number of diseases. The studied proteins contained a total of 2413 mutations of which 1935 (80%) appeared in secondary structures. Differences in mutation patterns between secondary structures and whole proteins were generally not statistically significant whereas within the secondary structural elements numerous highly significant features were observed. Conclusion Numerous trends in mutated and mutant amino acids are apparent. Among the original residues, arginine clearly has the highest relative mutability. The overall relative mutability among mutant residues is highest for cysteine and tryptophan. The mutability values are higher for mutated residues than for mutant residues. Arginine and glycine are among the most mutated residues in all secondary structures whereas the other amino acids have large variations in mutability between structure types. Statistical analysis was used to reveal trends in different secondary structural elements, residue types as well as for the charge and volume changes.


Background
Now that the sequence of the human genome is almost complete, the research interest in genomics has moved from determining the sequence to the analysis of genetic variations, e.g. the Human Variome Project [1], and collecting data in locus-specific mutation databases [2,3]. There is also a race to develop methods for cost-effective sequencing of the genomes of individuals [4]. Missense mutations in coding DNA, which lead to single amino acid changes in proteins, are commonly linked to human disorders [5]. The number of documented disease-linked missense and nonsense mutations is close to 30,000 [6]. A disease phenotype can arise because an amino acid change results in the loss of a critical protein function, in structural alterations, or because the mutation leads to "gain of function" effects such as functional dysregulation or the formation of toxic aggregates [7][8][9].
Only correctly folded proteins can deliver all the functional properties of a protein. Even minor changes in the size or properties of an amino acid side chain can alter or prevent the function of the protein. On the other hand, even large deletions or insertions may be tolerated in numerous positions within a protein [10]. The effects of mutations are also dependent on the protein sequence and structure context of the alteration. General statistical analyses have been performed for disease-causing mutations, for non-synonymous SNPs (nsSNPs) [9,[11][12][13][14][15][16], for groups of diseases, such as immunodeficiencies [3], and for groups of proteins, such as protein kinases [17]. Based on these studies and others, a number of methods have been developed for the prediction of tolerance and the consequences of mutations [13,[18][19][20][21][22][23].
Structural information is needed to fully understand the effects and consequences of mutations, whether diseasecausing or used purposefully to modify the properties of a protein e.g. in protein engineering. Three-dimensional structures and computer models have been used by us and others to elucidate disease mechanisms from amino acid substitutions e.g. in references [24][25][26]. We have reviewed and discussed the applicability of more than 30 sequence and structure utilizing methods to predict the outcome of missense mutations to explain the basis of diseases [27]. Activity modifying mutations are also valuable for understanding the functions and conservation of amino acids and sequence regions in protein families. Recently, all the possible amino acid substitutions and their effects on biophysical properties were investigated in five proteins [28].
Secondary structural elements, α-helices, β-strands, turns and bends, are basic structural components of protein scaffolds. Amino acids are differently distributed between these elements. This information has been utilized for decades to predict the location of secondary structures from sequence information e.g. in references [29][30][31][32][33][34]. Secondary structures are common regular conformations of polypeptides, and they are the most energetically favourable structures. Each secondary structure type has characteristic backbone φ and ψ angles. Secondary structures fold together in proteins and form higher order structures such as super secondary structures, motifs, domains and tertiary structures. The organization of the folds is very similar in related protein structures even though the sequence identities can be very small. Secondary structures are generally of substantial length and can pass through the hydrophobic core of globular proteins. Within protein families the secondary structures are more conserved than the surface loops connecting the adjacent elements.
Since secondary structures are structural building blocks that cover some 25 -75% of the length of proteins, it would be interesting to know how disease-related, and thus function and/or structure altering, mutations affect these elements. We investigated the occurrence, location and distribution of disease-causing mutations in secondary structures. The study is based on statistics and bioinformatical analysis of three-dimensional structures and protein sequence information. Not many differences occur in mutation types between secondary structure elements and whole proteins. Clear differences were observed within the mutation spectra for different secondary structural elements and regions outside secondary structures. Some features, like the overrepresentation of arginines, were evident in all the secondary structures. Our analysis covers different amino acid substitutions, alterations of physicochemical properties, and sequence conservation. We investigated mutations both in the mutated original residues and in the mutant, altered, amino acids.

Results and discussion
Secondary structures have an energetically favourable organization of the polypeptide chain. Our aim was to obtain a comprehensive map of the differences in mutation frequencies, location, contact energies and changes in residue volume and charge, both in the mutated amino acids and in the mutant amino acids, for the different secondary structure types. We collected information for 44 proteins involved in a large number of diseases (Table 1). The criteria for choosing the proteins were a relatively large number of reported missense mutations and the availability of the three-dimensional structure. The genes in Table 1 are listed with the recommended HGNC names (HUGO Gene Nomenclature Committee) [35,36]. The number of missense mutations varied from 8 to 240 per investigated protein or domain. The proteins represent different activities and functions including enzymes, signalling proteins, membrane proteins, receptors etc. 42 of the total of 46 structures had a resolution greater than 2.00 Å. There are more PDB entries than proteins because for the large BTK and VDR proteins there are two structures for different domains. The studied proteins contained altogether 2413 mutations of which 1935 (80%) appeared in secondary structures. The amino acid composition of all the proteins is in Figure 1. Considering the large size of the dataset and diversity of protein types and functions the statistically significant results reveal the true nature of disease-causing amino acid changes. In the χ 2test, results were considered significant with a P value < 0.05.
The total chain length of the 44 investigated proteins is 12540 amino acids. The secondary structure elements consist as follows: all helix structures altogether 4567  amino acids of which α-helices 4118 (~90%), 3 10 -helices 449 (~10%) and π-helices just 2 residues. There are 153 amino acids in β-bridges and 2530 in β-ladders, altogether 2683 (27%), and there are 1436 (15%) and 1190 (12%) residues in turns and bends, respectively. 18 proteins belong to SCOP [37] class for all-α proteins, 12 proteins belong to all-β proteins, 22 proteins belong to α and β proteins (14 α/β and 8 α+β) and 2 to coiled coil pro-teins. Btk PH domain belongs to all-β proteins and kinase domain to α+β proteins. In VDR, nuclear receptor ligandbinding domain belongs to all-α proteins and glucocorticoid receptor-like (DNA-binding domain) to SCOP class for small proteins. Altogether there were 1939 mutations in the secondary structures, 48% in helices, 28% in βstructures, and 24% in turns and bends. These numbers follow the distribution of the structure elements. The Amino acid distribution and relative mutability Figure 1 Amino acid distribution and relative mutability. Top row. Amino acid distribution in the investigated proteins (left), overall relative mutability of mutated (middle) and mutant residues (right). The same information is on the second row only for α-helices, third row for β-strands, fourth row for turns and bends, and in the bottom row for structures outside secondary structural elements.
length of the secondary structures varied as follows: α-helices from 1 to 46 residues, average length being 11.3 residues, 3 10 -helices from 2 to 9 residues, average 3.4. In βstrands the length varied from 1 to 22, average 5.28. Length of the structure in turns varied from 1 to 8 residues, average 2.1, and in bends from 1 to 6, average 1.6. Table 2 summarizes all the mutations in the studied proteins and Table 3 mutations in the secondary structures only. DSSP classifies secondary structures into three helix types, two extended β-strand types, turns and bends. The helices represent the classical α-helix, π-helix and 3 10helix. The difference between the helix types originates from the number of residues per turn. In the normal righthanded α-helix there are 3.6 residues per turn, while πhelix has 4.4, and 3 10 -helix only three. In helical structures the main chain residues form stabilizing hydrogen bonds with residues further in the sequence. All the side chains are pointing out from the helical core. A large proportion of α-helices are amphipathic i.e. one face of the helix is hydrophilic and the other hydrophobic [38]. The 3 10helix is the fourth most common type of secondary structure in proteins after α-helices, β-strands, and reverse turns [39]. 3 10 -helices commonly appear as N-or C-terminal extensions of α-helices. They are typically only three residues long compared with a mean of 10-12 residues for α-helices [40]. β-Strands are divided into isolated βbridges and extended strands. β-Strands can form structure stabilizing hydrogen bonds only with another strand in a parallel or antiparallel manner. Bends and turns are, according to DSSP terminology, relatively short structures that are located between the helical and strand elements. Well organized βand γ-turns reverse the direction of the polypeptide chain. Only glycines are allowed in certain turns at certain positions due to steric restrictions in very tight bends.

Overall amino acid mutational spectrum
First we investigated the statistical significance for mutated and mutant amino acids located in secondary structures compared to overall distribution. Then the patterns within each secondary structural element were revealed.
Arginine is clearly the most mutated residue type, as already indicated in previous studies [41][42][43]. The very high mutability of codons for arginine arises from CpG dinucleotides, which can spontaneously mutate by deamination either to TG or CA dinucleotides [44]. Arginine is coded by six codons, four of which have a CpG dinucleotide in the first and second codon position. In addition to the CpG dinucleotide we have also shown previously that the surrounding sequence context has an effect on mutability [42]. We calculated the relative mutability of all the amino acids as both mutant and mutated residues (Fig. 1). The residue that had the lowest ratio of observed vs. expected mutations (marked '1') was used as a reference in these calculations. This residue is lysine for mutated and alanine for mutant residues.
Among the original residues, arginine has clearly the highest relative mutability. The picture is completely different for the mutant residues, where the overall relative mutability among mutant residues is highest for cysteine and tryptophan. Table 2 shows that C, D, H, K, P and W are highly mutated residues. In the second place after arginine in mutated residues is cysteine. Glycine and tryptophan are also highly significantly overrepresented among mutated residues. Proline is the only amino acid forming a ring with the backbone and it has very rigid structure, which bends the main chain of the protein in a characteristic way. Proline is a known breaker of secondary struc-tures [30]. K, E, Q, F and T are significantly underrepresented among mutated residues.
From Table 3 can be seen that the distribution of mutations in secondary structural elements is very close to that expected based on the mutation count in the whole proteins. Only S as mutated and M as mutant residue type are significantly underrepresented. Based on the amino acid distributions (Fig. 1) the composition of α-helices is close to the general distribution whereas β-structures and turns and bends have clearly different distribution. The amino acid compositions within secondary structures are very different. The share of α-helices is 36%, β-structures 21% and turns and bends 21% (remaining 21% is in areas not defined to any secondary structure type) of all the residues in the investigated structures. The secondary structures harbour 80% of all the disease-causing mutations. The amino acid distribution outside the regular secondary structural elements is very similar to whole protein except Changes to properties of amino acids caused by mutations for high overrepresentation for P. The trends for mutated and mutant residues are similar to whole proteins. R has very high relative mutability for original residues (Fig. 1).
The only remarkable difference between Tables 2 and 4 (data calculated based on the amino acid distribution in secondary structures) is methionine, which is highly overrepresented in Table 2 for all the mutations as a mutant residue, but has no statistical correlation in the secondary structures (Table 4). A, L and I appear much less frequently amongst the replacing residues than expected. Our results are in line with those published previously for general mutation distribution [11,12].

Mutational spectrum within α-helices, β-strands and turns and bends
The amino acid composition in α-helices is very similar to overall amino acid composition (Fig. 1). The only main difference is in the ratio of glycines and prolines, which are clearly depleted in the helices, whereas glycines, as expected, are especially strongly overrepresented in turns and bends. The most prominent residues in β-strands are aliphatic isoleucine, leucine and valine followed by alanine, phenylalanine and threonine. Aspartic acid is surprisingly frequent in turns and bends. We further investigated the mutations within these secondary structures. The distribution of disease-causing mutations in helical structures in the investigated 44 structures is shown in Table 5. The number of mutations in helices is altogether 928 (48%), whereas β-strand structures of extended strands and isolated β-bridges contain 550 (28%) missense mutations (Table 6) and hydrogen bonded turns and bends contain 461 (24%) mutations ( Table 7). The results for original and mutant residues are clearly and significantly very different for the different structural elements. In helices, besides arginine and cysteine other residues are not very highly mutated. If only the secondary structures are taken into account, glycine appears to be the most mutated residue after arginine. This result is similar to a Steward et al. study that revealed S u m 1 9 3 9 1 9 3 9 1 9 3 9 1 9 3 9 a χ 2 -numbers in italics indicate underrepresentation and numbers in bold overrepresentation compared to random distribution based on amino acid frequencies.
that glycine was the second most frequently mutating amino acid in the OMIM disease database [9].
In helices, lysine is a strongly underrepresented residue compared with the calculated expected value. P, C, Q, W and D show statistically significant overrepresentation when compared to expected values, while S and L are underrepresented among mutant amino acids. Relative mutability in α-helices indicates that arginine is 7.29 times more mutated than expected whereas glycine is 4.88 times so.
In β-structures, R is statistically very significantly overrepresented and G and H have weaker enrichment. T is the statistically least mutated residue in β-strands. Among mutant residues, C is highly overrepresented and A underrepresented. Otherwise the results are less biased than in helices.
R is also statistically the most overrepresented residue in turns and bends although the χ 2 score is smaller than for the other secondary structures. G is the only other statistically highly significantly overrepresented amino acid. G is the most flexible residue because it does not have a side chain. It often appears in tight turns where no other residue can replace it. K is highly underrepresented among the mutated residues. Interestingly, R is the most enriched amino acid among mutant residues.
The distributions of amino acid frequencies and relative mutabilities in the different structural elements (Fig. 1) are clearly very different. First of all, the values are higher for mutated residues than for mutant residues, indicating that the original residue in many instances is very important and substitutions to any other residue are not possible without detrimental effects. Arginine and glycine are among the most mutated residues in all secondary structures whereas the other amino acids have large variations between the secondary structure types. The effect of arginine can also be seen in the data for mutant residues. Point mutations in arginine lead mostly to cysteine and glutamine, which have relatively high mutability values. The mutant residues in turns and bends have a more even distribution than the other two elements because practically any mutation to key residues in a tight hairpin turn leads either to the loss of hydrogen bonds or does not allow tight turn formation and leads to consequent structural alterations.
The data for mutations outside regular elements (Table 8) indicate that R, H and Y are significantly overrepresented and Q, K and E underrepresented among mutated resi- S u m 1 9 3 9 1 9 3 9 1 9 3 9 1 9 3 9 a χ 2 -numbers in italics indicate underrepresentation and numbers in bold overrepresentation compared to random distribution based on amino acid frequencies. The results of the χ 2 are shown with significance level: * P < 0.05; ** P < 0.01; *** P < 0.001. a χ 2 -numbers in italics indicate underrepresentation and numbers in bold overrepresentation compared to random distribution based on amino acid frequencies. The results of the χ 2 are shown with significance level: * P < 0.05; ** P < 0.01; *** P < 0.001.  The results of the χ 2 are shown with significance level: * P < 0.05; ** P < 0.01; *** P < 0.001.  The results of the χ 2 are shown with significance level: * P < 0.05; ** P < 0.01; *** P < 0.001. a χ 2 -numbers in italics indicate underrepresentation and numbers in bold overrepresentation compared to random distribution based on amino acid frequencies. The results of the χ 2 are shown with significance level: * P < 0.05; ** P < 0.01; *** P < 0.001. dues while N is the only residue type that is significantly overrepresented.

Mutational spectrum in amino acid groups within structural elements
Amino acids can be grouped based on their physicochemical nature. The reason for performing an analysis with residue categories is that in many sites the specific amino acid is not very important but the suitable properties it provides is. This can be seen e.g. in multiple sequence alignments of protein families. We used the following six groups: hydrophobic (V, I, L, F, M, W, Y, C), positively charged (R, K, H), negatively charged (D and E), conformational (G and P), polar (N, Q, S) and (A and T) [45]. These groups follow the amino acid substitution matrices used for sequence alignments and database searches. The results for amino acid frequencies in the secondary structure types are in Tables 9, 10, 11. Different groups are over-or underrepresented in different structure types. Negatively charged residues, due to arginine, are overrepresented in all three tables. Conformational residues are also overrepresented in all three secondary structure classes, but with varying χ 2 -values. Positively charged amino acids are significantly underrepresented in helices and overrepresented in turns and bends.
Mutations to polar residues are significantly depleted in α-helices and especially so in turns and bends. The number of significant observations is lower in mutant residues. Negatively charged amino acids are overrepresented in both β-structures and in bends and turns, although the enrichment is relatively weak. Alanine and threonine are underrepresented in both helices and βstrands. These results likely reflect the importance of the original residue. Many disease-causing mutations affect the same positions indicating that the site is structurally or functionally important, where any mutation disrupts the protein activity.

Changes in amino acid volume and charge due to mutations
Next we examined the differences in volume and charge between the original residue and the replacing mutant residue (Fig. 2). The replacing residues within α-helices and β-strand structures are generally physically smaller than the original residues while in turns and bends the volume is remarkably increased. In 3 10 -helices the replacing residue is on average 13.5 Å 3 smaller than the original residue and in α-helices almost 4 Å 3 smaller. Mutations to residues in β-bridges increase the volume, while mutations in β-strands on average reduce the volume of the amino acid.
The increase in volume in turns is 22.6 Å 3 and in bends 15.8 Å 3 , thus on average the replacing residue is much larger than the original residue. In turns and bends, glycines form the biggest group among mutated native amino acids. Because glycine is the smallest amino acid, all mutations in G increase the volume of the protein. Differences at the amino acid level show that there are clear peaks in some residues, and that the changes in all secondary structure types are very similar (Fig. 3). Glycine (68 Å 3 ) is replaced by residues that are on average 85 Å 3 larger in turns and bends, 83 Å 3 larger in α-helices and 78 Å 3 larger in β-strands. The largest residues are W, Y, F and R. The largest amino acid, tryptophan (237 Å 3 ) is replaced by residues whose volume is on average 100 Å 3 smaller in β-strands and 94 Å 3 and 93 Å 3 smaller in α-helices and turns and bends, respectively. Being a bulky residue, tryptophan is very rare in turns and bends. In our dataset W accounted for 1% of all residues in turns and bends. There was just one mutated W in bends and three in turns.
Average changes in isoelectric point (IEP) are similar for the main secondary structure types (Fig. 2C). In all the structures the IEP is increased on average compared to the normal structure. There are differences in the extent of the change, the largest changes are in turns and bends -on average close to 0.8 pI. The reason for the dramatic average increase in pI values is the introduction of prolines and lysines that have relatively high pI values. Arginine has the highest pI value of all amino acids, but in turns and bends there is almost equally high enrichments of arginines both in mutated and mutant residues, so arginines do not affect IEP crucially. Figure 2D also shows in detail the changes in different helices and β-structures.
The change is much larger in 3 10 -helices than in α-helices. A similar situation is seen for bends in which mutations lead to clearly larger changes. Differences at the amino acid level show that changes in charge in α-helices and βstrand structures are minor but in turns and bends there are clear peaks in some of the residues. C and E are replaced by residues with lower pI values and basic K and R by residues with higher pI values (Fig. 4).

The role of contact energies
We used RankViaContact [46] to calculate residue contact energies for all proteins in the dataset based on threedimensional structures. The contact energies of the mutated residues range from very strong -27.6 to 7.8. The mutations are ranked based on their calculated contact energies. A slight majority (55%) of the mutations have strong or very strong contact energies. Most residues with strong contact energies are important for the stability of the protein structure. Many of the disease-related mutations are thus located in crucial structural sites in which alterations are deleterious. In figure 5A the count of contact energies of the mutated residues has been calculated in all proteins and separately for secondary structures. In β-strand the distribution of contact energies are even, but in other structures the distribution is biased towards weak positive contact energies. Figure 5B shows the contact energy distribution for mutations in proteins and secondary structure types as well as outside the secondary structural elements. The curves have quite similar overall shapes although the location of the peak for the maximal occurrence varies. Of note is also that the mutation positions in turns and bends and outside secondary structures do not appear in sites of very strong contacts. Mutation sites in β-structures have very even distribution throughout the contact energy range.
We organized the residues into six groups based on their physicochemical properties and calculated the percentages of strong and weak contact energies in the groups (Fig. 6). Among hydrophobic residues more than 90% of the original amino acids have strong contact energies. Polar, conformational, positively and negatively charged mutated residues have for the most part weak contact energies (70-85%). A and T residues have mainly strong contact energies. These results indicate the importance of the residue type for interactions -whether hydrophobic, van der Waals or electronic interactions. Charged residues often form salt bridges, while hydrophobic and aliphatic amino acids are involved in weaker interactions. A large proportion of mutated residues are forming interactions which are essential for the protein and its structure. Struc-tural alterations are the most common consequence of disease-causing mutations [18].

Conclusion
Germline substitutions leading to the replacement of a native residue can result in either benign effects (e.g. polymorphisms) or in genetic disease. It has been shown that most positions in proteins can be altered without serious effects on protein structure or function [47]. On the other hand, the majority of disease-causing mutations have structural effects [14]. Therefore, mutations that are phenotypic for disease indicate the importance for the specific location. Knowledge of the molecular basis of mutations is important and can be used in several ways.
Mutation spectra are clearly different for the different structural elements. The most prominent feature for all the elements is the strong overrepresentation of arginine as mutated residue. Large changes in the properties of mutant amino acids, in volume or charge, are diseaserelated, but there are large differences for the structural elements. Contact energy distributions of mutated residues are surprisingly similar except for β structures. About half of the mutation sites are involved in strong or very strong amino acid interactions. In conclusion, there are many and strong trends in mutant and mutated residues.
Changes in mutations to residue charges The trends are statistically significantly different for the different secondary structures.
Previously, mutation statistics have been studied at a general level [8,12,48] as well as a structural level [9,12], but detailed analysis of the spectrum and effects of mutations within secondary structural elements has been missing. In the study of Ferrer-Costa et al. [12] the dataset consisted of 1169 disease-associated single amino acid polymorphisms (daSAPs) distributed over 73 proteins with structure information from the PDB. The study combines information from all secondary structure elements with-out element specific data. In the study of Steward and coworkers [9] 63 proteins contained 1292 disease-associated sites. They did not analyse secondary structures separately. Our study involved 1939 missense mutations in secondary structures and 2411 mutations altogether. We localized the majority of missense mutations to α-helices.
The biggest group of daSAPs is found in coils, and helices contain 36.7% of all daSAPs. Ferrer-Costa et al. also included volume comparisons in their study. In contrast to our work, the size changes to daSAPs were calculated for the whole dataset. We calculated the changes for each secondary structure type separately, also at amino acid level. Arginine is the most commonly mutated residue. The result is identical to Steward et al [9]. Wang and Moult [18] examined 262 missense mutations in 23 proteins and concluded that 80% of mutations destabilize the protein structure relative to the folded state based on changes in hydrophobic burial, backbone strain, overpacking, and electrostatic interactions. In several other studies, structural properties have been combined with a sequence profile for the mutated position [21,23,49], which reflects the occurrence of other amino acid types at corresponding sites in homologous proteins.
Vitkup et al. [11] investigated in total, 4236 mutations from 436 genes and concluded that mutations at arginine and glycine residues are together responsible for about 30% of genetic diseases. This result is similar to ours. In our dataset, 25% of all missense mutations occur in R and G, and 23% are present in the examined secondary structures. Vitkup et al. also found that random mutations at tryptophan and cysteine have the highest probability of causing disease, which is in line with our results. The overall relative mutability to mutant residues is 3.46 for C and 3.31 for W compared to alanine which has the lowest ratio of observed vs. expected mutations and is considered as 1.
To correlate the mutated and mutant types to protein structures it is important to know at which secondary structural element the alteration appears. Protein function and interactions require both stability and specificity. Proteins fold according to the minimum free energy. In contrast, they organize themselves to recognize a transition state or a ligand [50]. Amino acids and mutations have very distinct differences in the frequencies in the secondary structural elements. Multiple sequence alignments can be used to investigate allowed substitutions in protein families. Our analysis revealed mutation types which are most likely deleterious. This information could also be used for the development and optimization of amino acid comparison tables for individual secondary structural elements. Our data could also add to the reliability of the predictions of mutation effects.
Distribution of contact energies for mutated amino acids Figure 5 Distribution of contact energies for mutated amino acids. The energies were calculated with the RankViaContact program. A) Data for all mutations, and B) distribution in secondary structural elements α-helices (red), β-strands (blue), turns and bends (green), outside secondary structures (yellow), and whole proteins (black).

Methods
We investigated proteins for which numerous diseasecausing missense mutations were available along with an experimentally defined three-dimensional structure. Mutation data was collected from BTKbase [51][52][53] for Btk mutations, CD40Lbase [54][55][56] for CD40L mutations and from Human Gene Mutation Database (HGMD) [6] for the remainder. The publicly available mutations are in the supplementary table 4 [see Additional file 4], full list is available from authors by request. The DSSP program [57] was used to identify secondary structures for three helix types, two extended β-strand types, turns and bends, based on the stereochemistry of the protein structure. The DSSP program was used to systematically assign secondary structures for each residue in the three-dimensional structures obtained from the Protein Data Bank (PDB) [58]. A Perl script was written to connect mutation information to protein structure data (dssp files).
If more than one structure existed for a protein or a protein domain, the one with the highest resolution and the longest chain length was chosen. All data was stored in a mySQL database. Altogether we investigated 44 proteins with a total of 12540 residues, of which 9878 were in the examined secondary structures. The proteins contained 2411 different disease-causing mutations, 1939 present in secondary structures.
The RankViaContact service was used to calculate residueresidue contact energies based on a coarse-grained model [46]. The energy parameters used for residue-residue contacts [59] were derived considering the secondary structural environments. The contact energies were estimated for all the missense mutations located in the secondary structures.
Mutation statistics were analysed by comparing the frequencies of the obtained mutations with the expected values. Expected values for mutated residues within αhelices, β-strands and turns and bends were calculated using the distribution of all amino acids in respective secondary structure. In the case of the mutant amino acids within different secondary structures, the expected values were calculated from codon diversity by taking into account all possible amino acid substitutions. In order to reveal how the mutation distribution in secondary structures compares to the overall distribution of mutations in the dataset, the expectation values for mutated and mutant residues were calculated based on all the mutations in secondary structures.
Mutant residues with strong and weak contact energies Figure 6 Mutant residues with strong and weak contact energies. The percentage of mutant residues with strong (black) and weak (white) contact energies in physicochemical amino acid groups.
The χ 2 test was used to determine the significance of the results. Chi square values were calculated using the following formula: where f o is the observed frequency and f e is the expected frequency for an amino acid. P-values and 95% confidence intervals were estimated in one-tailed fashion. Relative mutability was calculated using the formula: where N' is the least mutated residue type that was obtained by calculating the ratio between observed and expected value. N represents the number of mutated original or mutant residues for an amino acid type. The relative mutability was calculated for each residue type in all the investigated secondary structural elements.
In order to calculate changes in the isoelectric point we used the Emboss iep program [60] to predict IEPs for native and mutant proteins. The average change in IEP in each secondary structure type was calculated. The changes in residue volumes in different secondary structure types were obtained by using residue volumes [61]. The results were weighted by the amount of individual mutations and averaged by the number of mutant residues in respective secondary structures.