A comparative analysis of the foamy and ortho virus capsid structures reveals an ancient domain duplication
© The Author(s) 2017
Received: 11 November 2016
Accepted: 10 March 2017
Published: 4 April 2017
The Spumaretrovirinae (foamy viruses) and the Orthoretrovirinae (e.g. HIV) share many similarities both in genome structure and the sequences of the core viral encoded proteins, such as the aspartyl protease and reverse transcriptase. Similarity in the gag region of the genome is less obvious at the sequence level but has been illuminated by the recent solution of the foamy virus capsid (CA) structure. This revealed a clear structural similarity to the orthoretrovirus capsids but with marked differences that left uncertainty in the relationship between the two domains that comprise the structure.
We have applied protein structure comparison methods in order to try and resolve this ambiguous relationship. These included both the DALI method and the SAP method, with rigorous statistical tests applied to the results of both methods. For this, we employed collections of artificial fold ’decoys’ (generated from the pair of native structures being compared) to provide a customised background distribution for each comparison, thus allowing significance levels to be estimated.
We have shown that the relationship of the two domains conforms to a simple linear correspondence rather than a domain transposition. These similarities suggest that the origin of both viral capsids was a common ancestor with a double domain structure. In addition, we show that there is also a significant structural similarity between the amino and carboxy domains in both the foamy and ortho viruses.
These results indicate that, as well as the duplication of the double domain capsid, there may have been an even more ancient gene-duplication that preceded the double domain structure. In addition, our structure comparison methodology demonstrates a general approach to problems where the components have a high intrinsic level of similarity.
Taxonomically, the Orthoretrovirinae (orthoretroviruses) and Spumaretrovirinae 1 (spumaviruses) make up the two subfamilies of Retroviridae. They share many similarities, including overall genome structures with gag, pol and env genes encoding proteins for replication and life cycles involving reverse transcription and integration into the chromosomes of infected cells. However, there are also a number of differences distinguishing these viral subfamilies, including finer details of genome organisation, the absence of a Gag-Pol fusion protein in spumaviruses and the timing of reverse transcription .
Gag is the major structural protein of both Ortho and Foamy viruses and is responsible for many of the differences and similarities between the viral subfamilies. Ortho and Foamy viral Gags are required for particle assembly, budding from the cell, reverse transcription and delivery of the viral nucleic acid into the newly infected cell. However, there are a number of striking differences including how the Gag precursor is targeted to the cell membrane, the absence of a Major Homology Region and Cys-His box in Foamy viruses and very different patterns of processing during viral maturation . In all Ortho viruses, Gag is proteolytically cleaved to form distinct, well-studied proteins, matrix (MA), capsid (CA) and nucleocapsid (NC), found in mature virions, whilst in spumaviruses Gag processing to remove a C-terminal peptide occurs only in a fraction of the Gag molecules .
Structural information regarding foamy virus Gag has been limited to the crystal structure of the N-terminal Env binding region of Prototypic Foamy virus (PFV) Gag (PFV-Gag-NtD) that although maintaining some of the function of orthoretrivial MA shared no structural similarity . However, more recently the solution NMR structure of the PFV Gag central CA domains has shed new light on the relationship between ortho and spumaviruses. It reveals that the CA structures of both viral subfamilies share a common protein fold, implying that their Gag proteins may be evolutionarily related .
However, an intriguing aspect of this relationship was an ambiguity in the degree of relatedness between the CA domains of the Gag proteins, with the Spumaretroviral CA domains, NtDCEN and CtDCEN, appearing almost equally similar to either the amino- (CA-NtD) or carboxy-terminal (CA-CtD) domains of the orthoretroviruses. With small domains that share a high degree of background similarity, particularly those composed entirely of α-helices, it is very difficult to evaluate the significance of their structural relationships as chance combinations of a few helices can give rise to an apparently convincing overlaps with a low RMSD.
In this paper, we now investigate and clarify the nature of the relationship between these capsid domains and discuss its evolutionary implications. Our work provides a demonstration of a general approach to the resolution of difficult comparison problems in which the proteins share a high intrinsic level of similarity.
Although this initial superposition (Fig. 1) did not appear encouraging, the foamy virus structure was scanned across the Protein DataBank (PDB), using the DALI program  to search for any similarities.
Full chain scan
The result of the DALI search indicated that the Foamy virus structure shares some similarity with the capsid structure of the ortho-viruses. However, the matches consist only of a small number of helices and appears barely more convincing than other matches to proteins that seem very unlikely to have any meaningful connection to a viral capsid. The preponderance of capsid matches throughout the list of hits might seem to add some support to the relationship but may simply be a reflection of the number of capsid structures in the structure databank.
Adding confusion to the ortho/foamy relationship is the additional observation that the distribution of matches to the ortho-virus structures between the amino (N) and carboxy (C) terminal domains are mixed. For example; taking the top 10 matches, the N-terminal domain of the Foamy structure aligns with 6 C-terminal domains and 4 N-terminal domains of the ortho virsuses and the best match with the corresponding Foamy C-terminal domain aligns with an ortho N-terminal domain.
Although domain transposition is not impossible in viral genomes, it is sufficiently unexpected to warrant deeper investigation, especially as it is hard to imagine how an ancestral capsid protein could tolerate such a large rearrangement and still pack to form a competent shell. We therefore undertook a more thorougher evaluation using alternative methods to assess the statistical significance of these structural similarities.
Structural alignment significance
For each comparison, the DALI program calculates an empirical Z-score, combining an estimation of significance with protein length normalisation. The program reports all matches over Z=2, however, when the proteins are small and especially when the structures being compared are both predominantly alpha-helical in nature, then matches over this cutoff include many functionally unrelated hits where the similarity has arisen through the fortuitous alignment of a few helices.
The equivalent scans with the reversed domain structures, using both the foamy and ortho (HIV) structures (neither of which should have any particular relationship to the capsid or any other natural protein) also found hits with high Z-scores (black and blue points in Fig. 6, respectively). When compared with the native domains (Fig. 6), these decoys had a profile that tracked mostly above the N-terminal native domain but below the C-terminal domain. However, with the latter domain, this was only distinct in the hits to the full PDB whereas with the PDB-90, the native domain was only clearly better over the top 10 matches, half of which were to non-capsid structures.
The results with the simple reversed decoy using DALI suggested that the match of the foamy virus domains to the ortho virus capsid N-terminal domain may be due to chance and that the match to the C-terminal domain looks meaningful if based on the hits to the full PDB but may be only marginal based on the PDB-90 hits.
Customised decoy comparisons
Ortho and foamy domain comparison Z-score statistics
Statistical analysis of the decoy comparisons
The quality of the comparisons in Fig. 8 c can be quantified as a combination of their RMSD (R) and the number of matched (superposed) positions (N). However, as explained in the “Methods” section, for statistical analysis, it is easier to combine this pair of numbers as a single number, called the a-value (Equ n . 1), which is the scaling factor that causes a theoretical curve to pass through the point (R,N).
In this way, the significance of all combinations of the native ortho and foamy domain superpositions were calculated, using the background distribution of ‘customised’ decoy comparisons based on each individual native pair. The resulting Z-scores (σ units) are collected in Table 1. The degree of similarity between the domains ranged from less than 1 σ to over 5 σ, with the latter (highly significant) result being obtained for both a swapped (NC) and forward (CC) combination. However, of the top 12 scores, only three now came from swapped pairings.
The majority of values in Fig. 10 a lie below the 0.05 probability level for the larger sample sizes, with those for the top-half bias statistic (blue line) being more significant than the moment-based statistic (red line). While confirming the visual trend towards a bias of higher scoring like-type domain similarities, the analysis summarised in Fig. 10 a is complicated by having unequal numbers of amino and carboxy domain comparisons and also by including some closely related structures. To produce a more balanced data-set, one of each pair of the two most similar carboxy domain structures was discarded leaving five structures and for each of these, their matching amino terminal domain was also retained, leaving: BLV-1, HIV-1, HML2, HTLV-1 and RSV. Despite having a smaller set of comparisons (5N + 5C domains giving 20 rather than 38 Z-scores), the results for this reduced set indicated an equally clear bias towards towards a preferred like-domain equivalance, especially as measured by their occurrence in the upper half of the ranked list, with several having a probability below the 0.05 level and a few below the 0.005 level (Fig. 10 b).
ortho and foamy capsid domain comparison T-test significance
Avg: 6.67e-01 < 1.32e+00
Avg: 6.51e-01 < 1.25e+00
Tprob = 4.62e-21 **
Tprob = 2.35e-16 **
StD: 1.61e-01 = 2.12e-01
StD: 1.17e-01 = 1.89e-01
Fprob = 1.84e-01
Fprob = 1.12e-01
Avg: 4.92e-01 < 1.29e+00
Avg: 6.22e-01 < 1.30e+00
Tprob = 4.09e-10 **
Tprob = 3.81e-23 **
StD: 1.02e-01 < 2.21e-01
StD: 1.12e-01 = 1.77e-01
Fprob = 7.37e-03 **
Fprob = 1.20e-01
From these results, it can be seen that all the four possible pairings are highly significant with probabilities ranging from 10−10 to over 10−20. It is also clear that the two swapped pairings (NC and CN) have higher probabilities than the forward pairings (NN and CC). Combining the probabilities (P) as: Δ P= log10(P NN P CC )−l o g 10(P NC P CN ), gives a value of 17.7 (42.7 - 25.0) which means that the swapped pairing is almost 18 orders of magnitude less likely than the forward pairing. Calculating the same statistic on the reduced 5N+5C domain data set gave a similar result but with a difference reduced 1000-fold to 15 orders of magnitude.
The unexpected swapped pairing, which was indicated originally by the DALI results, now seems less likely. The preferred, and biologically more reasonable, result is that the ortho virus domain are related to the foamy virus domains as a result of genetic divergence from a common, double domain ancestor.
Such a relationship between the foamy domains implies an equivalent relationship in the ortho viruses and a similar comparison in structures of their N and C domains finds matches with Z-scores ranging from 2 to 4. As with the comparison of the ortho and foamy structures, these can be pooled to allow a joint T-test to be applied. This gave a probability of 10−8 that the true N/C domain comparisons were drawn from the decoy distribution, adding strong support to the hypothesis of an ancient gene duplication occurring before the split of the ortho and foamy virus families. (Fig. 11 a, b, blue triangles). Supporting this relationship, earlier studies also suggested an internal duplication in the ortho virsuses but were based largely on very distant sequence similarity .
This test was applied only to the comparison of domains between viruses with known structures for both domains, however, it is not unreasonable to compare amino and carboxy domains across all viruses. The longer loops in the ortho virus domains gives greater scope of structural variation and a wide range of variation was seen ranging from RMSD values under 4 to over 12. When normalised for length (a-value from Equ n . 1) and partial matches under 60 positions excluded, a distinct cluster remains between a=0.5…0.8 (4...6Å RMSD) but still with a long tail to higher values. Despite this tail, the T-test on the distributions is highly significant at 2.7×10−17.
To summarise the structural relationships among the ortho and foamy domains, the matrix of pairwise comparisons was projected into a three-dimensional fold-space. (See “Methods” for details). This produces a best visual representation of the RMSD values between domains.
Discussion and conclusions
The comparison of small domains that are largely composed of α-helices presents a challenging problem in how to interpret the significance of the RMSD values. As the individual helical secondary structure elements (SSEs) constitute a sizeable fraction of the domain, it takes only the chance alignment of a few helices to result in a low RMSD over a large proportion of the structure, giving an apparently meaingful result.
The use of the customised decoy-model sets, as illustrated here, attempts to avoid this problem by recreating a large number of possible folds that were generated using the same (reconnected) SSEs. Moreover, to avoid any chance recreation of native fragments, each comparison always involved the comparison of a native (forward) chain direction with a reversed chain. Using these models, a background distribution of decoy/decoy comparisons allowed us to calculate Z-scores for each native/native comparison between the different Gag proteins. This has the advantage that every comparison in the background distribution involved two models with the same length, residue packing density and secondary composition as the native pair. These values indicated a clearly significant relationship between the foamy and ortho CA structures.
Direct or transposed domain order?
Although the decoy model alignment strategy did confirm the relationship between the foamy and ortho CA structures, the Z-scores did not point to a clear resolution of whether the domains should have a direct correspondence (NN and CC match) or a transposed relationship (NC and CN) as significant individual matches were found across all pairings. Testing for a bias towards more significant like-domain pairings (NN, CC) in the list of similarities ranked by Z-score confirmed the visual bias towards a natural correspondence but only at a marginal level of significance (around 0.05). By contrast, the application of a T-test on the combined raw comparison data returned a very clear distinction between the direct and the transposed relationships, clearly favouring the more natural forward order.
However, although the “astronomic” probabilities calculated by the T-test seem very convincing, they must be viewed in the light of the much lower probabilities calculated from the asymmetry statistics. Both calculations involve assumptions and are limited by the small number of known structures so neither can be taken as definitive. Nevertheless it would seem likely that the “true” level of significance may lie somewhere between the two results and as both of these objective assessments point in the direction of the NN and CC domain order, there is no reason to adopt the more unexpected transposed domain order.
On the basis of these structural comparisons, and a variety of recently described functional assays , we can conclude that the central region of the spumavirus gag gene encodes a polypeptide sequence related to that of the corresponding region of orthoretroviral, CA. It therefore seems reasonable to suppose that the last common ancestor of orthoretroviruses and spumaviruses possessed such a sequence. Moreover this region appears to be made up from two related all helical subdomains suggesting a gene duplication event in a common precursor.
In our initial search employing foamy virus CA using the DALI program, we made the observation that the strongest similarity of the foamy virus CA domains was actually with a cellular protein, Arc (Activity-Regulated Cytoskeleton-associated protein). Arc is required for neural synaptic growth and activity [13–16] and mis-regulation and/or deletion contributes to diseases of cognition [14, 16, 17]. Arc has widespread and clear sequence homologues as far back as insects and probably deeper, giving it a very ancient origin somewhere close to the metazoan root [12, 18] and based on sequence homology Arc is considered to be a relic of an ancient Ty3/Gypsy retrotransposon , preserved as a ‘living fossil’ in metazoan genomes. Given the structural relatedness of foamy virus CA and Arc, this might suggest an equally ancient origin for foamy virus CA. As it is believed that the Ty3/Gypsy family of retrotransposons gave rise to retroviruses , it will therefore be of considerable interest to determine whether the Gag of Ty elements also comprise CA proteins with a two-domain structure.
It is also noteworthy that Ty3 Gag is significantly smaller than that of the foamy and orthoretroviruses and although it contains CA related sequences there is no equivalent of either orthoretroviral MA or PFV Gag-NtD, regions of Gag necessary for membrane targeting, budding and extracellular release of virions. Therefore, given the very different structures of MA [20–23] and Gag-NtD , this raises the possibility that the MA and Gag-NtD domains of the orthoretroviruses and foamy viruses were co-opted by independent events that has resulted in the viruses employing different mechanisms to facilitate budding from the cell. Notably, Gag from Gypsy, an Errantivirus capable of extracellular replication  and Arc contain additional N-terminal domains. In Gypsy-Gag this domain is distantly sequence-related to orthoretroviral MA . By contrast, in Arc it contains a coiled coil region  reminiscent of spumavirus Gag-NtD [4, 25] further supporting the notion of a shared origin for Arc and foamy virus Gag that is distinguishable from an alternative acquisition pathway giving rise to Gypsy and the orthoretroviruses.
The foamy virus structures were obtained from the Protein Structure Databank (PDB code:5M1G) .
BLV: bovine leukemia virus (deltaretrovirus) 4PH1 (N-ter.dom) and 4PH2 (C-ter.dom) ,
BLV6: bovine leukemia virus (hexameric) 4PH0 (both dom.s) ,
HIV6: human immunodeficiency virus 1 3H47 (both dom.s) ,
HML2: human endogenous retrovirus type-K (betaretrovirus) ,
HTLV: human T-cell leukemia virus (deltaretrovirus) 1QRJ (both dom.s) ,
JSRV: jaagsiekte sheep Retrovirus (betaretrovirus) 2V4X (N-ter.dom) ,
MLV: murine leukemia virus (gammaretrovirus) 1U7K (N-ter.dom) ,
MPMV: Mason-Pfizer monkey virus (betaretrovirus) 2KGF (N-ter.dom) ,
PSIV: prosimian immunodefficiency virus (ancient lentivirus) 2XGV (N-ter.dom) ,
RELIK: rabbit endogenous lentivirus type-K (ancient lentivirus) 2XGU (N-ter.dom) ,
RSV: Rous sarcoma virus (alpharetrovirus) 3G1I (both dom.s) .
The DALI method for searching the PDB with a structural query  was accesed via the server at: http://ekhidna.biocenter.helsinki.fi/ dali_server. The DALI method reports the significance of each match with an estimated Z-score which is the raw comparison score, normalised by the combined length of the proteins. Z-scores down to a value of 2 are reported by the program.
The list of DALI hits (ranked by Z-score) were assessed by how many high-scoring capsid structures had been identified. These true/false (T/F) hits were defined simply by protein descriptions that contained the words “CAPSID”, “GAG” or “P24”. This may have misclassified a few (low scoring) hits to the matrix protein and missed some hits where the primary description refers to a cyclophilin structure solved in complex with the capsid.
DALI reports structural hits in both the full PDB and a reduced collection of structures that have no pair of proteins with over 90% sequence identity, referred to as the 90% non-redundant or PDB-90 collection. It was found, however, that some hits, seen in the full PDB were not found in the PDB-90, for example in Fig. 6, all of the top 31 hits of the N-domain against the full PDB are missing in the PDB-90 hits. The most likely explanation is that the PDB-90 secection has not been updated at the same time as the full collection. For this reason, hits to both databases were monitored.
The SAP method for structure comparison  was run as a local copy which can be accessed at: https://github.com/WillieTaylor/util . As part of determining the alignment between two structures, the SAP program calculates a similarity score for each pair of matched positions which is how similar the rest of the structure looks from the viewing-frame of the superposed residues. This value can be used both to weight the importance of positions when calculating the (rigid-body) RMSD superposition and to colour positions in the superposed structures . (As in Fig. 3).
If the matched positions are ranked by this value, then RMSD values can be calculated over increasingly larger subsets to high-light the extent of a well matched core before the contribution of variable loops, or domain shifts, leads to higher RMSD values. (As in Fig. 1 b).
Decoy structure construction
Reversed structure decoys
Simple structural decoys were generated from native PDB structures by reversing the order of the α-carbon atoms in the PDB file using the Unix command line:
cat native.pdb | grep ’ CA ’ | sort -nr -k2 > reverse.pdb
The reversal of a protein chain does not alter the chirality of the alpha helix and these decoys can be used directly in SAP. However, DALI requires all main-chain atoms and these must be regenerated for the reversed decoys. This was done using the simple ca2main program which can also be found at: https://github.com/WillieTaylor/util. The method is based on the geometry of the α-carbon-virtual chain using relationships described in ref. .
Customised structural decoys were generated for each comparison using each of the pair of structures being compared to create two pools of decoys then comparing all decoys in the first pool against all decoys from the second but with their chain reversed as described in the previous section.
The decoys were created as described in Ref. : starting by cyclising the chain then introducing new termini in each surface loop to create cyclic permutations. In addition, when three loop regions lie in close proximity, their ends are also reconnected in such a way that if a chain, comprising four segments (1…4) runs from amino (N) to carboxy (C) termini through three adjacent loop regions a-b, c-d and e-f (i.e.: N,1,a-b,2,c-d,3,e-f,4,C) then the reconnected chain runs: N,1,a-d,3,e-b,2,c-f,4,C with each switch being made at the least disruptive point between a pair of loops. This chain switching does not create any reversed segments which would otherwise form regions of local matching when the whole chain is reversed.
In a pair of structures, if each have four surface loops where breaks can be made, then including the native termini, this gives five cyclic permutations and if two groups of loops can be reconnected then a total of 15 distinct decoys can be made from each native starting structure. As these can be compared pairwise, a pool of 225 decoy derived data points is generated that constitutes the random background against which the native/native comparison can be assessed.
For example, in Fig. 8, the 36 data points marked by a solid circle come from the comparison of six cyclic permutations of a native ortho domain compared with six permutations of a reversed foamy domain that includes a single loop reconnection.
Every pair drawn from this pool will have the same lengths as the two native structures as well as the same secondary structure composition, surface exposure, residue packing density and inertial properties but each decoy will have a different chain fold.
RMSD length normalisation
The quality of structure comparisons can be characterised by a combination of their RMSD value and the number of matched (superposed) positions. How to combine these values has been the subject of much discussion over the years and central to this is the expected random RMSD value for two proteins of a given length [39–41]. However, when reviewed , all these measures were approximations of a simple square-root function of the protein length (as originally proposed by McLachlan on theoretical grounds ) but with an added term to depress the RMSD values obtained with small units or structure that are dominated by secondary structure elements (and super-secondary structure motifs) giving a lower than expected RMSD value. The formula that best captures this is: R=√ N(1− exp(−N 2/s 2)), where, R is the expected random RMSD for N matched positions and s is the damping factor in the inverted Gaussian term (equivalent to the standard deviation in the Normal distribution).
the line will pass through the data point. This reduces the pair of values (R,N) to a single value a that is a simpler quantity for statistical analysis.
The best value for s is slightly dependent on the nature of the proteins being compared. For artifical (random-walk) models with no secondary structure, no modification will be needed but the proteins considered here have segments of packed alpha helices that can be locally similar over two to three helices. To correct for this, a value of s=30 was used (or 1/s 2=0.11) which is higher than the value of 1/s 2=0.03 used previously. That this is a reasonable fit to the data can be seen in the way the dashed blue lines in Fig. 8 track the upper and lower boundary of the decoy comparison results.
When a=1, the point lies on the random line and when a=0, the RMSD is zero, so values of a that approach this lower bound will be of interest when evaluating similarity.
The a-values obtained using Equ n . 1 were plotted as frequency histograms using using only data points that had a length of N±10, where N is the maximum number of matched positions in the comparison of the two native structures. As the sample size is small (typically, 100–300), these plots are quite noisy but their overall distribution does not deviate too greatly from a Gaussian distribution. This was tested on the difference between the observed and ideal cummulative distribution functions (CDFs) using the Kolmogorov-Smirnov test in the statistical package "R". Of the 38 samples from each domain pairing, the null hypothes "that the sample was drawn from a Normal distribution" could be rejected in only two cases with a confidence below the 0.01 significance level or three below the 0.05 level. (See the Additional file 1 for details). The underlying distribution becomes more apparent when the data sets are combined in Fig. 11 d.
Previously, a cumulative plot of RMSD was used to select an optimal value for N (giving the minimum a-value). This can be important if the full set of matched positions is dominated by a high deviations from variable loop regions. However, in the current application, the small length of the foamy virus loops meant that this was not an important aspect and the full number of matched positions was taken. Otherwise, the same correction would have to be applied to all decoy comparisons to maintain a fair comparison. (See Fig. 8, where the black dot marks the minimum a-value length).
The mean and standard deviation of the a-values in the N±10 region were calculated and the corresponding Normal distribution used to calculate Z-scores for the associated native comparison. (See Fig. 9 a, for an example).
Data from separate native/native comparisons, with their customised decoy data, were combined giving not only a much larger background population of decoy derived scores but also a small population of native comparison scores that can be tested to calculate the probability that they were drawn from the same population as the decoy data. To do this, a T-test was used which takes the size, mean, and standard deviation of each distribution and calculates a probability. The implementaion of this test was taken from the Numerical Reicpies collection  which implements one of two variants of the test depending on whether the distributions have statistically distinct standard deviations. (Routines ttest() and tutest()). The choice of routine is based on a preapplication of an F-test on the standard-deviations. (Using the routine ftest()).
The values quoted in the Results section are for a two-tailed T-test, however, as it is expected that the native comparisons should always be more similar than comparisons between random models, then a one-tailed T-test would be valid, which gives half the probability. As the values in the Tables are so significant and only the relative relationships are of interest, then the choice is unimportant.
The results of the pairwise similarity within a set of structures can be visualised by treating the RMSD values as Euclidean distances5 and reducing their dimensionality to sufficiently few dimensions to be visualised: usually 2D or, better 3D, to visualise the space with less distortion. Rather than use a simple multi-dimensional scaling (MDS) method (), the more complicated method of multi-dimensional projection was used (, see  for a simpler exposition).
This method reduces the dimensionality of the projection in gradual stages with each step employing triangle-inequality balancing and hyper-dimensional real-space refinement. In the real-space refinement stages, a weight can be applied to pairwise distances. (This cannot be done in direct MDS projection, which can only assign a mass to each point). Weights were assigned to distances as a function of their inverse RMSD, up to a maximum value of 1.
1 This class is also commonly referred to as the Foamy viruses (after the morphological effect they have on infected cells) and will be referred by this name frequently below, with the term orthoretroviruses also contracted to “Ortho viruses”.
2 http://ekhidna.biocenter.helsinki.fi/ dali_server, see “Methods” section for details.
3 True/false hits were defined by protein descriptions with the words “CAPSID”, “GAG” or “P24”.
4 Note that reversing the α-carbon backbone does not change the chirality of the αhelices but as DALI requires a full atomic backbone, this must be restored on the reversed chain.
5 In theory, pairwise RMSD values are guaranteed to constitute a consistent Euclidean metric, but only in N-1 dimensions (where N is the number of structures compared).
The work was supported by the Francis Crick Institute under awards: FC001179 (WRT), FC001162 (JPS) and FC001178 (IAT). The Crick receives its core funding from Cancer Research UK, the UK Medical Research Council, and the Wellcome Trust.
Availability of data and materials
All data and source codes used in this work can be found on the githup site:
https://github.com/WillieTaylor in depositories sapit (decoy generation code) and FoamyCapsid (code and data specific to the current application).
WRT, JPS and IAT conceived the work and evaluated the results. WRT executed the computational work. All authors contributed to the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Lindemann D, Rethwilm A. Foamy virus biology and its application for vector development. Viruses. 2011; 3:561–85.View ArticlePubMedPubMed CentralGoogle Scholar
- Müllers E. The foamy virus Gag proteins: what makes them different?. Viruses. 2013; 5:1023–1041.View ArticlePubMedPubMed CentralGoogle Scholar
- Flügel RM, Pfrepper KI. Proteolytic processing of foamy virus Gag and Pol proteins. Curr Top Microbiol Immunol. 2003; 277:63–88.PubMedGoogle Scholar
- Goldstone DC, Flower TG, Ball NJ, Sanz-Ramos M, Yap MW, Ogrodowicz RW, Stanke N, Reh J, Lindemann D, Stoye JP, Taylor IA. A unique spumavirus Gag N-terminal domain with functional properties of orthoretroviral matrix and capsid. PLoS pathogens. 2013; 9:1003376.View ArticleGoogle Scholar
- Ball NJ, Nicastro G, Dutta M, Pollard D, Goldstone DC, Sanz-Ramos M, Ramos A, Müllers E, Stirnnagel K, Stanke N, Lindemann D, Stoye JP, Taylor WR, Rosenthal PB, Taylor IA. Structure of a spumaretrovirus gag central domain reveals an ancient retroviral capsid. PLoS Path. 2016; 12:1005981. doi:10.1371/journal.ppat.1005981.View ArticleGoogle Scholar
- Taylor WR. Protein structure alignment using iterated double dynamic programming. Prot Sci. 1999; 8:654–65.View ArticleGoogle Scholar
- Holm L, Sander C. Protein-structure comparison by alignment of distance matrices. J Molec Biol. 1993; 233:123–38.View ArticlePubMedGoogle Scholar
- Zhang W, Wu J, Ward MD, Yang S, Chuang YA, Xiao M, Li R, Leahy DJ, Worley PF. Structural basis of arc binding to synaptic proteins: implications for cognitive disease. Neuron. 2015; 86:490–500.View ArticlePubMedPubMed CentralGoogle Scholar
- Taylor WR. Protein structure domain identification. Prot Engng. 1999; 12:203–16.View ArticleGoogle Scholar
- Taylor WR. Decoy models for protein structure score normalisation. J. Molec. Biol. 2006; 357:676–99.View ArticlePubMedGoogle Scholar
- Levitt M, Gerstein M. A unified statistical framework for sequence comparison and structure comparison. Proc Natl Acad Sci USA. 1998; 95:5913–920.View ArticlePubMedPubMed CentralGoogle Scholar
- Campillos M, Doerks T, Shah PK, Bork P. Computational characterization of multiple gag-like human proteins. Trends in Genetics. 2006; 22:285–589.View ArticleGoogle Scholar
- Chowdhury S, Shepherd JD, Okuno H, Lyford G, Petralia RS, Plath N, Kuhl D, Huganir RL, Worley PF. Arc/Arg3.1 interacts with the endocytic machinery to regulate AMPA receptor trafficking. Neuron. 2006; 52:445–59.View ArticlePubMedPubMed CentralGoogle Scholar
- Park S, Park JM, Kim S, Kim JA, Shepherd JD, Smith-Hicks CL, Chowdhury S, Kaufmann W, Kuhl D, Ryazanov AG, et al.Elongation factor 2 and fragile X mental retardation protein control the dynamic translation of Arc/Arg3.1 essential for mGluR-LTD. Neuron. 2008; 59:70–83.View ArticlePubMedPubMed CentralGoogle Scholar
- Shepherd JD, Rumbaugh G, Wu J, Chowdhury S, Plath N, Kuhl D, Huganir RL, Worley PF. Arc/arg3.1 mediates homeostatic synaptic scaling of AMPA receptors. Neuron. 2006; 52:475–484.View ArticlePubMedPubMed CentralGoogle Scholar
- Waung MW, Pfeiffer BE, Nosyreva ED, Ronesi JA, Huber KM. Rapid translation of Arc/Arg3.1 selectively mediates mGluR-dependent LTD through persistent increases in AMPAR endocytosis rate. Neuron. 2008; 59:84–97.View ArticlePubMedPubMed CentralGoogle Scholar
- Niere F, Wilkerson JR, Huber KM. Evidence for a fragile x mental retardation protein-mediated translational switch in metabotropic glutamate receptor-triggered arc translation and long-term depression. J Neurosci. 2012; 32:5924–936.View ArticlePubMedPubMed CentralGoogle Scholar
- Volff JN. Cellular genes derived from Gypsy/Ty3 retrotransposons in mammalian genomes. Annals New York Acad Sci. 2009; 1178:233–43.View ArticleGoogle Scholar
- Llorens C, Fares MA, Moya A. Relationships of gag-pol diversity between Ty3/Gypsy and retroviridae LTR retroelements and the three kings hypothesis. BMC Evol Biol. 2008; 8:276.View ArticlePubMedPubMed CentralGoogle Scholar
- Hill CP, Worthylake D, Bancroft DP, Christensen AM, Sundquist WI. Crystal structures of the trimeric human immunodeficiency virus type 1 matrix protein: implications for membrane association and assembly. Proc Natl Acad Sci USA. 1996; 93:3099–104.View ArticlePubMedPubMed CentralGoogle Scholar
- Prchal J, Srb P, Hunter E, Ruml T, Hrabal R. The structure of myristoylated mason-pfizer monkey virus matrix protein and the role of phosphatidylinositol-(4,5)-bisphosphate in its membrane binding. J Mol Biol. 2012; 423:427–38.View ArticlePubMedPubMed CentralGoogle Scholar
- Rao Z, Belyaev AS, Fry E, Roy P, Jones IM, Stuart DI. Crystal structure of SIV matrix antigen and implications for virus assembly. Nature. 1995; 378:743–7.View ArticlePubMedGoogle Scholar
- Riffel N, Harlos K, Iourin O, Rao Z, Kingsman A, Stuart DI, Fry E. Atomic resolution structure of moloney murine leukemia virus matrix protein and its relationship to other retroviral matrix proteins. Structure. 2002; 10:1627–1636.View ArticlePubMedGoogle Scholar
- Song SU, Gerasimova T, Kurkulos M, Boeke JD, Corces VG. An env-like protein encoded by a drosophila retroelement: evidence that gypsy is an infectious retrovirus. Gene Dev. 1994; 8:2046–2057.View ArticlePubMedGoogle Scholar
- Tobaly-Tapiero J, Bittoun P, Giron ML, Neves M, Koken M, Saib A, de The H. Human foamy virus capsid formation requires an interaction domain in the N-terminus of Gag. J Virol. 2001; 75:4367–4375.View ArticlePubMedPubMed CentralGoogle Scholar
- Obal G, Trajtenberg F, Carrion F, Tome L, Larrieux N, Zhang X, Pritsch O, Buschiazzo A. Conformational plasticity of a native retroviral capsid revealed by X-ray crystallography. Science. 2015; 349:95–8. doi:http://dx.doi.org/10.1126/science.aaa5182.
- Gamble TR, Vajdos FF, Yoo S, Worthylake DK, Houseweart M, Sundquist WI, Hill CP. Crystal structure of human cyclophilin A bound to the amino-terminal domain of HIV-1 capsid. Cell; 87:1285–1294.Google Scholar
- Worthylake DK, Wang H, Yoo S, Sundquist WI, Hill CP. Structures of the HIV-1 capsid protein dimerization domain at 2.6å resolution. Acta Crystallogr., Sect. D. 1999; 55:85–92. doi:10.1107/S0907444998007689.View ArticleGoogle Scholar
- Pornillos O, Ganser-Pornillos BK, Kelly BN, Hua Y, Whitby FG, Stout CD, Sundquist WI, Hill CP, Yeager M. X-ray structures of the hexameric building block of the HIV capsid. Cell. 2009; 137:1282–1292. doi:http://dx.doi.org/10.1016/j.cell.2009.04.063.
- Mortuza GB, Dodding MP, Goldstone DC, Haire LF, Stoye JP, Taylor IA. Structure of B-tropic MLV capsid N-terminal domain. J Mol Biol; 376:1493–1508.Google Scholar
- Khorasanizadeh S, Campos-Olivas R, Clark CA, Summers MF. Sequence-specific 1H, 13C and 15N chemical shift assignment and secondary structure of the HTLV-I capsid protein. J Biomol NMR. 1999; 14:199–200.View ArticlePubMedGoogle Scholar
- Mortuza GB, Goldstone DC, Pashley C, Haire LF, Palmarini M, Taylor WR, Stoye JP, Taylor IA. Structure of the capsid amino terminal domain from the betaretrovirus, Jaagsiekte sheep retrovirus. J Molec Biol. 2009; 386:1179–1192.View ArticlePubMedGoogle Scholar
- Mortuza GB, Haire LF, Stevens A, Smerdon SJ, Stoye JP, Taylor IA. High-resolution structure of a retroviral capsid hexameric amino-terminal domain. Nature. 2004; 431:481–5.View ArticlePubMedGoogle Scholar
- Macek P, Chmelik J, Krizova I, Kaderavek P, Padrta P, Zidek L, Wildova M, Hadravova R, Chaloupkova R, Pichova I, Ruml T, Rumlova M, Sklenar V. NMR structure of the N-terminal domain of capsid protein from the mason-pfizer monkey virus. J. Mol. Biol. 2009; 392:100–14. doi:http://dx.doi.org/10.1016/j.jmb.2009.06.029.
- Goldstone DC, Yap MW, Robertson LE, Haire LF, Taylor WR, Katzourakis A, Stoye JP, Taylor IA. Structural and functional analysis of prehistoric lentiviruses uncovers an ancient molecular interface. Cell Host Microbe. 2010; 8:248–59.View ArticlePubMedGoogle Scholar
- Bailey GD, Hyun JK, Mitra AK, Kingston RL. Proton-linked dimerization of a retroviral capsid protein initiates capsid assembly. Structure. 2009; 17:737–48. doi:10.1016/j.str.2009.03.010.View ArticlePubMedGoogle Scholar
- Rippmann F, Taylor WR. Visualization of structural similarity in proteins. J Molec Graph. 1991; 9:3–16.View ArticleGoogle Scholar
- Levitt M, Greer J. Automatic identification of secondary structure in globular proteins. J Molec Biol. 1977; 114:181–293.View ArticlePubMedGoogle Scholar
- McLachlan AD. How alike are the shapes of two random chains?. Biopolymers. 1984; 23:1325–1331.View ArticlePubMedGoogle Scholar
- Cohen FE, Sternberg MJE. On the prediction of protein structure: the significance of the root-mean-square deviation. J Molec Biol. 1980; 138:321–33.View ArticlePubMedGoogle Scholar
- Maiorov VN, Crippen GM. Significance of root-mean-square deviation in comparing three-dimensional structures of globular proteins. J Mol Biol. 1994; 235:625–34.View ArticlePubMedGoogle Scholar
- Press WH, Flannery BP, Teukolsky SA, Vetterling WT. Numerical Recipes: The Art of Scientific Computing. Cambridge: Cambridge Univ. Press; 1986.Google Scholar
- Brown NP, Orengo CA, Taylor WR. A protein structure comparison methodology. Computers Chem. 1996; 20:359–80.View ArticleGoogle Scholar
- Aszódi A, Taylor WR. Hierarchical inertial projection: a fast distance matrix embedding algorithm. Computers Chem. 1997; 21:13–23.View ArticleGoogle Scholar
- Taylor WR, May ACW, Brown NP, Aszódi A. Protein structure: Geometry, topology and classification. Rep Prog Phys. 2001; 64:517–90.View ArticleGoogle Scholar
- Taylor WR, Chelliah V, Hollup SM, MacDonald JT, Jonassen I. Probing the “dark matter” of protein fold-space. Structure. 2009; 17:1244–1252.View ArticlePubMedGoogle Scholar
- Aszódi A, Taylor WR. Folding polypeptide α-carbon backbones by distance geometry methods. Biopolymers. 1994; 34:489–506.View ArticleGoogle Scholar