In our analysis, we have estimated the extent of length variation in protein superfamilies and employed it as a measure of structural variation between homologous proteins. Numerous measures have been used to quantify protein structural similarity and these include RMSD, SSAP, contact maps, DALI and VAST scores [23–26]. We are interested in the tolerance of folds to large variations in length and have, therefore, employed standard deviation and mean length variation to determine this. Proteins of similar lengths may still differ in the orientations of individual secondary structures and adopt different folds. To that extent, a simple scoring scheme that parses pre-derived structural alignments of known related proteins from the PASS2 database and quantifies the extent of length variation in all protein superfamilies is used to empirically estimate trends emerging in the dataset. We have also performed the analysis on multi-membered domain superfamilies (> 3 members) for an empirical assessment of the data involving 353 domain superfamilies. Additionally, trends obtained in the dataset are noticed on consideration of the most length-deviant or highly populated domain superfamilies alone.
We have presented a method, CUSP, which processes protein structure-driven alignments to identify conserved structural units, common to all related proteins. In doing so, regions that allow variations to accumulate and confer uniqueness to each protein, annotated as USB, are also identified for every superfamily. The scoring schemes were arrived at after examining alignments derived independently from other approaches such as CE and CDD. In 8 of the 10 superfamilies examined, CUSP detects > 60% of the conserved residues reported by other alignment methods. For the two superfamilies which show < 45% coverage, the large difference in the number of structural entries examined may be responsible for the difference in performance. While the alignments from CDD included very close sequence homologues, the structural representatives considered in the CE alignment included domain members of similar lengths and also include more sequence diverse members. A strict cut-off of 75% is employed to characterize structural types at each alignment position as H, E or C and this in fact, increases the stringency of the scores. These assignment and scores therefore, are representative and predictive of SSB assignments in all and new sequence relatives in the superfamily. Cut-off schemes similar to ours have been employed earlier in JOY representations of structural alignments and in estimating equivalences of secondary structures (SSE) for deriving matrices.
We have also attempted a study of the domain contexts, associations of length deviant domains and their functional consequences (Table S2, and manuscript under preparation). Reeves et. al.,  have examined equivalent secondary structures between CATH superfamilies and suggest that such additional structural elements contribute effectively to functional variety in the highly populated superfamilies.
Since the CUSP algorithm works with a scoring scheme to detect consensus trends in a majority of the superfamily members, the extent of conservation of each structural type in each block is annotated and it is possible, therefore, to extract features that correlate with the extent of conservation of each structural type. An analysis of the nature of such USBs shows that additional lengths can either occur as extensions or insert in the middle of a protein structure. A class-specific trend for the type of structure adopted in indel regions has also emerged in the current analysis and each class prefers a specific type of structure (Figure 3, Figure S1 (b-d)). Figure 4 shows examples of different superfamilies that exhibit class-specific nature in accommodating length variations.
We find that in all superfamilies examined, the structurally unconserved regions amongst related proteins do not all retain a uniform pattern in solvent accessibility. This coincides with the expectation that it is in such regions that variation in lengths between proteins is introduced. To preserve the core scaffold, which may be the driving force in limiting the number of folds, indel regions are more prone to structural changes and this may result in greater solvent exposure in some proteins or alter protein surfaces to modify interaction interfaces. β-strands show a universal preference for solvent avoidance and this reflects the preference of such strands to avoid isolations from the protein core and integrate into the structure as well-ordered sheets (Table S2). In proteins of the α-β class, coils show a clear preference for solvent exposure, more so in α + β class superfamilies where they are vital in segregating α and β units. Inferences on solvent exposure, in the present analysis, are limited to individual domains of the proteins and do not consider multi-domain contexts and oligomerisation states of the proteins.
Based on the extent of length variation observed in different superfamilies, we have clustered all the superfamilies into length-rigid and length-deviant groups. Interestingly, length-rigid proteins are not as well-populated (as reflected in the number of members that are functionally diverse and in the number of families) as length-deviant proteins. While on the one hand, this does indicate that with the availability of more structures, trends in length-deviations could be affected in the identified rigid superfamilies, one may argue that such superfamilies are not preferred due to their strict length limitations and limited functional promiscuity (as reflected in the number of families). Length-deviant proteins, on the other hand, are found to include superfolds such as the P-loop NTP hydrolases, Ferredoxin folds etc., that have already been shown to be well represented in many genomes.
In many length-deviant protein superfamilies, despite large differences in length (over two fold in some cases), the core is often well preserved. The large additional lengths often do not involve the active site and in many cases they affect the oligomerization states and interacting surfaces of the protein (Ferritin like domain superfamily), introduce substrate-specificity (SH3 domains) and in some cases play an auto-regulatory role (Table S2). Since our analysis is derived from the PASS2 database of domain superfamilies, which in turn is guided by the domain definitions of SCOP, it is highly likely that severe length deviation, exhibited as additional domains, have escaped our attention.
These interesting trends that we have obtained on the nature and type of indels in protein superfamilies from different classes could impact the area of comparative modeling in indel regions of newer superfamily members. We have obtained some distinct trends on indels that are class-specific, with information on typical lengths. Such information, we expect, will be useful in the choice of specific structural types for newer relatives of protein superfamilies. Each superfamily shows a distinct trend in length variability and such information can be fed, by the assignment of variable gap penalties, into sequence alignment approaches to improve homology detection amongst members that vary considerably in length. We trust that such analyses would provide guiding principles during sequence searches, alignment and homology modeling of distant relationships.