CUSP: an algorithm to distinguish structurally conserved and unconserved regions in protein domain alignments and its application in the study of large length variations

Background Distantly related proteins adopt and retain similar structural scaffolds despite length variations that could be as much as two-fold in some protein superfamilies. In this paper, we describe an analysis of indel regions that accommodate length variations amongst related proteins. We have developed an algorithm CUSP, to examine multi-membered PASS2 superfamily alignments to identify indel regions in an automated manner. Further, we have used the method to characterize the length, structural type and biochemical features of indels in related protein domains. Results CUSP, examines protein domain structural alignments to distinguish regions of conserved structure common to related proteins from structurally unconserved regions that vary in length and type of structure. On a non-redundant dataset of 353 domain superfamily alignments from PASS2, we find that 'length- deviant' protein superfamilies show > 30% length variation from their average domain length. 60% of additional lengths that occur in indels are short-length structures (< 5 residues) while 6% of indels are > 15 residues in length. Structural types in indels also show class-specific trends. Conclusion The extent of length variation varies across different superfamilies and indels show class-specific trends for preferred lengths and structural types. Such indels of different lengths even within a single protein domain superfamily could have structural and functional consequences that drive their selection, underlying their importance in similarity detection and computational modelling. The availability of systematic algorithms, like CUSP, should enable decision making in a domain superfamily-specific manner.

2 structures in an alignment to enable a quick visual assessment of equivalent structures. The sequence and structural alignments displayed in separate panels, allows users to define color schemes for core secondary structural elements. The application calculates the number of protein secondary structures in each sequence and projects results in a tabular format to facilitate comparisons of the distribution of secondary structures across and within multiple families.

S5. Conservation of Solvent accessibility in conserved structural units
As described for the calculation of block scores in the CUSP algorithm (in methods), PSA scores were assigned to structural blocks to correlate conservation of solvent accessibility in structural blocks. Averaged PSA scores of each block were clustered into bins of 0-30%, 30-50% and >50% to indicate buried, partially exposed and exposed regions, respectively. The distribution of PSA scores in the 'high' conserved blocks of the three structural types [α, β and coil] were plotted to determine if solvent accessibility is conserved in a class-specific manner.
Considerations of the PSA scores are limited to the treatment of the domains as monomers and multimeric assemblies are not included in the calculations.

Cytochrome C
The cytochrome-C superfamily includes many proteins that are vital components of electron transfer mechanisms in both prokaryotes and eukaryotes. Diverse sequences (~24% sequence identity) specify a compact cytochrome-C structure shared by all members. The Cytochrome C fold typically, consists of at least four αhelices that envelope a heme group, a short 3 10 -helix and several turns. Related members show up to two-fold variation in length and are represented by 'dwarf' domains such as cytochrome C-551 and cytochrome C-553 [~70-80 residues] as well as 'giant' domains such as methylamine dehydrogenase and cytochrome C-552 [~130-150 residues]. The CUSP algorithm when applied to alignments involving members of diverse lengths arrives at a structural consensus that detects the structural integrity of the heme-binding 3 pocket involving at least four αhelices and a predominantly hydrophobic pocket that is well conserved amongst all members[S1] The CXXCH motif that lies on spatial motifs originating from different structural elements is also detected. Alignments of the family involving different members and independently derived through CE [S2] show that the CUSP algorithm detects ~69% of the structurally equivalent residues detected by CE (Table S3). We have examined the functional roles of the additional structural motifs that appear in the giant members of the superfamily and find that they appear to characterize each protein and confer thermal stability to certain members. Most differences in length are due to variations in the lengths of surface loops connecting the αhelices.  Table 2). c) Distribution of structural types in indel regions of the 64 length deviant domain superfamilies (1-64 on the X axis correspond to the 64 length domain superfamilies listed in Table 2). d) Structural type in indel regions of the highly populated domain superfamilies listed in Table 1.

Additional tables:
Table S1: List of 'Length-rigid superfamilies' (>4 members) across all the structural classes.   between longest and shortest members of 'length-rigid' superfamilies.      I i i f h Interaction interfaces that dictate function: In the giant member, the capsid protein has a protruding (P) domain connected by a flexible hinge to a shell (S) domain that has a classical eightstranded beta-sandwich motif. The structure of the P domain is unlike structure of the P domain is unlike that of any other viral protein with a subdomain exhibiting a fold similar to that of the second domain in the eukaryotic translation elongation factor-Tu. This subdomain, located at the exterior of the capsid, has the l ii  (57) supersecondary structure in the catalytic domain common to the alpha-amylase family enzymes, though the barrel is incomplete, with a deletion of an alpha-helix between the fifth and sixth betastrands.