- Open Access
LRRML: a conformational database and an XML description of leucine-rich repeats (LRRs)
BMC Structural Biologyvolume 8, Article number: 47 (2008)
Leucine-rich repeats (LRRs) are present in more than 6000 proteins. They are found in organisms ranging from viruses to eukaryotes and play an important role in protein-ligand interactions. To date, more than one hundred crystal structures of LRR containing proteins have been determined. This knowledge has increased our ability to use the crystal structures as templates to model LRR proteins with unknown structures. Since the individual three-dimensional LRR structures are not directly available from the established databases and since there are only a few detailed annotations for them, a conformational LRR database useful for homology modeling of LRR proteins is desirable.
We developed LRRML, a conformational database and an extensible markup language (XML) description of LRRs. The release 0.2 contains 1261 individual LRR structures, which were identified from 112 PDB structures and annotated manually. An XML structure was defined to exchange and store the LRRs. LRRML provides a source for homology modeling and structural analysis of LRR proteins. In order to demonstrate the capabilities of the database we modeled the mouse Toll-like receptor 3 (TLR3) by multiple templates homology modeling and compared the result with the crystal structure.
LRRML is an information source for investigators involved in both theoretical and applied research on LRR proteins. It is available at http://zeus.krist.geo.uni-muenchen.de/~lrrml.
Leucine-rich repeats (LRRs) are arrays of 20 to 30 amino acid long protein segments that are unusually rich in the hydrophobic amino acid leucine. They are present in more than 6000 proteins in different organisms ranging from viruses to eukaryotes . The structure of the LRRs and their arrangement in repetitive stretches of variable length generate a versatile and highly evolvable framework for the binding of manifold proteins and non-protein ligands . The crystal structure of the ribonuclease inhibitor (RI) yielded the first insight into the three-dimensional molecular basis of LRRs . It has a horseshoe shaped solenoid structure with parallel β-sheet lining the inner circumference and α-helices flanking its outer circumference. To date, there are over one hundred crystal structures available. All known LRR domains adopt an arc or horseshoe shape .
The LRR sequences can be divided into a highly conserved segment (HCS) and a variable segment (VS). The highly conserved segment consists of an 11 or 12 residue stretch with the consensus sequence LxxLxLxxN(Cx)xL. Here, the letter L stands for Leu, Ile, Val or Phe forming the hydrophobic core, N stands for Asn, Thr, Ser or Cys, and x is any amino acid. The variable segment is quite diverse in length and consensus sequence, accordingly eight classes of LRRs have been proposed [4, 5]: 'RI-like (RI)', 'Cysteine-containing (CC)', 'Bacterial (S)', 'SDS22-like (SDS22)', 'Plant-specific (PS)', 'Typical (T)', 'Treponema pallidum (Tp)' and 'CD42b-like (CD42b)'.
The discrepancy between the numbers of structure-known LRR proteins and the structure-unknown ones triggered studies focusing on the homology modeling of LRR proteins [6–8]. Homology modeling is a computational method, which is widely used to identify structural features defining molecular interactions [8–10]. The modeling results are an important input for the design of biochemical experiments. The first step of homology modeling is the selection of a structure-known protein, which serves as a template for the unknown target structure. In practice, however, it is difficult to find a complete template which has a high enough sequence identity to the target repetitive protein (single template modeling), due to different repeat numbers and varying arrangements. This limitation can be overcome by combining multiple templates. First, the most similar structure-known LRRs are found for each LRR in the target sequence as a local template. Second, all local templates are combined to generate the multiple sequence alignments for the entire target sequence. Thus, it is possible to construct a start model for further investigation, even if no adequate single template is available. Such an approach, however, requires a comprehensive database of LRRs to extract adequate template candidates. So far, the individual three-dimensional LRR structures are not directly available from the established databases and there are only a few detailed annotations for them. Additional information such as sequence insertions and types is missing. In order to consolidate this information and to provide a source for homology modeling and structural analysis of LRR proteins, we developed LRRML, a database and an extensible markup language (XML) description of LRR structures.
Construction and content
Structure-known LRR proteins were extracted from the Protein Data Bank (PDB)  release Sept 10, 2008. In order to ensure that all LRR proteins were found, we combined three groups of search results. First, 'leucine rich repeat', 'leucine rich repeats', 'leucine-rich repeat', 'leucine-rich repeats', 'lrr' and 'lrrs' were used as key words in the PDB quick search; second, 'SCOP classification -> Alpha and beta proteins (a/b) -> Leucine-rich repeat' was used as options in PDB advanced search; third, 'CATH classification -> Alpha Beta -> Alpha-Beta Horseshoe -> Leucine-rich repeat' was used as options in PDB advanced search. Because of the irregularity (mutations and insertions in the sequence) of LRRs reliable identifications of LRRs contained in the LRR proteins could only be performed manually. We inspected the three-dimensional structures of the LRR proteins using molecular viewers and identified each LRR based on two criteria:
A LRR begins at the beginning of the highly conserved segment (HCS) and ends at the end of the variable segment (VS) (just before the HCS of the next LRR).
The HCS of a LRR must pose a typical conformation, i.e. a short β-sheet begins at about position 3 and a hydrophobic core is formed by the four L residues at position 1, 4, 6, and 11.
The LRRs were then manually classified according to the consensus sequences [4, 5]. In addition to the eight canonical LRR classes listed in the background section we included a new class 'other' for the N-/C-terminal LRRs and some hyper-irregular LRRs. Table 1 illustrates the consensus sequences of the eight canonical LRR classes.
During the LRR identification and classification all sequence insertions longer than 3 residues were annotated. About one tenth of entries have insertions longer than 3 residues while few entries have deletions, which suggests that the evolution of LRRs may prefer insertion to deletion.
The LRRML release 0.2 contains 1261 LRR entries from 112 PDB structures. Among them 548 LRRs are distinct on sequence level, indicating that different molecules can share identical LRRs. By superimposition, we found that they also have highly similar structures. This fact enhances the confidence in modeling LRR proteins using multiple LRR templates. A histogram of entry length distribution (Figure 1) shows that the LRR lengths are concentrated in the interval from 20 to 29, which covers the characteristic lengths of consensus sequences of the eight canonical LRR classes. Some entries have a sequence longer than 30, because they contain large insertions. Table 2 presents the distribution of LRR entries and PDB entries over the nine classes respectively. The classification results are consistent with a previous report which showed that LRRs from different classes never occur simultaneously in the same protein and have most probably evolved independently . Exceptions to this rule are the T and S types which often exist in the same protein forming the super motif 'STT' . It is assumed that both evolved from a common precursor .
Currently, there are several protein databases containing information on LRRs, such as Pfam , InterPro , SMART  and Swiss-Prot . These databases predict the LRR numbers and boundaries for their LRR protein entries by various computational methods, no matter whether the entries have known three-dimensional structures or not, thereby 'false negative' occurs frequently. Table 3 lists the numbers of structure-known LRR proteins and their LRRs covered by these databases. As more detailed examples, LRR numbers of LRR proteins from different classes reported by the established databases are compared in Table 4. Additionally, the individual three-dimensional LRR structures are not directly available from these databases. In order to combine the information required for homology modeling and structural analysis, LRRML is provided with three prominent characteristics:
Each database entry is an individual three-dimensional LRR structure, which was identified with high accuracy.
Extensive annotations, such as systematic classification, secondary structures, HCS/VS partitions and sequence insertion, are provided.
LRRs were extracted from all structure-known LRR protein structures from PDB.
The extensible markup language (XML) was standardized in the 90s and is well established as a format for hierarchical data. It can be queried and parsed more easily by application programs. Therefore, more and more biological databases use the XML as data saving format and database management system (DBMS) [17–19]. LRRML was designed by using eXist , an XML DBMS, and using XPath/XQuery  for processing queries and web forms. We developed a LRR markup language (LRRML) for exchanging and storing LRR structures. It consists of four blocks of information:
The sequence information (XML tag <l:Sequence>): amino acid sequence and sequence length.
The classification information (XML tag <l:Type>): class name and consensus sequences.
The sequence partitions (XML tag <l:Regions>): amino acid sequence, position, length and insertion of HCS and VS.
The corresponding PDB sources (XML tag <l:Sources>): ID, chain, LRR number and classification of the source PDB entries; serial number, position, DSSP  secondary structure and three-dimensional coordinates of the current LRR in these source PDB entries.
The entire database can be browsed by LRR IDs or by PDB IDs. When browsing, the entries appear in a summary table containing at first ID, type and sequence. Clicking on an ID opens an XML Stylesheet (XSLT)  converted HTML web page that presents the entry in detail. The original XML file and the coordinates file in PDB format can also be downloaded. The XSLT file used is provided as Additional file 2. Aside from the textual view, a LRR structure can be visualized by the online molecular viewer Jmol . After loading, users can change the view settings flexibly by themselves. LRRML is provided with various search functions, including PDB ID search which returns all LRRs contained in this PDB structure, class search which returns all LRRs of this class, or length search which returns all LRRs with this sequence length. To simplify the homology modeling, the similarity search was implemented. It returns the structures of the most similar LRRs for a structure-unknown LRR. The target LRR sequence can be searched against the entire database, a certain LRR class or LRRs with a certain length. At first, a global pair wise sequence alignment with sequence identity will be generated for the target LRR and each of the LRRs in the user selected set. Then, the most similar LRRs will be returned as template candidates, ranked by sequence identity.
The DBMS provides a REST-style application programming interface (API) through HTTP, which supports GET and POST requests. A unique resource identifier (URI) 'http://zeus.krist.geo.uni-muenchen.de:8081/exist/rest/...' is treated by the server as path to a database collection. Also, request parameters can help select any required elements. For example, '_query' executes a specified XPath/XQuery; the URL "http://zeus.krist.geo.uni-muenchen.de:8081/exist/rest/db/lrrml?_query=//LRR [.//TAbbr='S']" returns all the S type LRRs.
Application in homology modeling
LRRML was designed as a tool for template selection in homology modeling of LRR proteins. Traditionally, the template used in homology modeling is one or more full length protein structures obtained via similarity search. Nevertheless, due to the different repeat numbers and arrangements of LRRs, the sequence identity between the target and the full length template is usually not high enough for homology modeling. With LRRML the most similar structure-known LRR can be found for each LRR in the target sequence as a local template. The combination of all local templates through multiple alignments helps to achieve a high sequence identity to the target.
As test case we modeled the structure of mouse Toll-like receptor 3 (TLR3) ectodomain. We assumed that the structure of mouse TLR3 ectodomain were unknown and excluded the LRRs of mouse/human TLR3 ectodomain from LRRML. Through similarity search the optimal template for each of the 25 LRRs in mouse TLR3 was found. The sequence identity between each LRR pair (target/template LRR) is listed in Table 5. Then a 26-line multiple alignment was generated by the 25 template sequences and the target sequence as the input of MODELLER 9v3 . The resulting three-dimensional model (Figure 3A) was evaluated by PROCHECK , with 98.2% residues falling into the most favored or allowed regions of the main chain torsion angles distribution, whereas the result of the TLR3 crystal structure (PDB code: 3CIG) was 98.6% (Figure 4). The mouse TLR3 has been shown to bind double-stranded RNA ligand with both N-terminal and C-terminal sites on the lateral side of the convex surface of TLR3 . The N-terminal interaction site is composed of LRRNT and LRR1-3, and the C-terminal site is composed of LRR19-21. We superimposed the resulting model onto the crystal structure of mouse TLR3 ectodomain at the two interaction sites by using SuperPose v1.0 . The root mean square deviations of the structures are 1.96 Å and 1.9 Å respectively (Figure 3B/C), indicating that the predicted model sufficiently well matched the crystal structure and was useful for prediction of ligand interaction sites. These results demonstrate that homology modeling using combined multiple templates obtained from LRRML can create valuable information to trigger further biochemical research. Interpretation of structural details, however, should be done exercising due care.
A specialised conformational leucine-rich repeats database called LRRML has been developed. It is supported by an XML database management system and can be searched and browsed with either an easy-to-use web interface or REST like interface. The interface is suitable for most graphical web browsers and has been tested on the Windows, Mac and Linux operating systems. LRRML contains individual three-dimensional LRR structures with manual structural annotations. It presents useful sources for homology modeling and structural analysis of LRR proteins. Since the amount of structure-determined LRR proteins constantly increases, we plan to update LRRML every 2 to 3 months.
Availability and requirements
This database is freely available at http://zeus.krist.geo.uni-muenchen.de/~lrrml.
Matsushima N, Tanaka T, Enkhbayar P, Mikami T, Taga M, Yamada K, Kuroki Y: Comparative sequence analysis of leucine-rich repeats (LRRs) within vertebrate toll-like receptors. BMC Genomics 2007, 8: 124–143.
Dolan J, Walshe K, Alsbury S, Hokamp K, O'Keeffe S, Okafuji T, Miller SFC, Guy Tear G, Mitchell KJ: The extracellular Leucine-Rich Repeat superfamily; a comparative survey and analysis of evolutionary relationships and expression patterns. BMC Genomics 2007, 8: 320–343.
Kobe B, Deisenhofer J: Crystal structure of porcine ribonuclease inhibitor, a protein with leucine-rich repeats. Nature 1993, 366: 751–756.
Kobe B, Kajava AV: The leucine-rich repeat as a protein recognition motif. Curr Opin Struct Biol 2001, 11: 725–732.
Bell JK, Mullen GE, Leifer CA, Mazzoni A, Davies DR, Segal DM: Leucine-rich repeats and pathogen recognition in Toll-like receptors. Trends Immunol 2003, 24: 528–533.
Kajava AV: Structural Diversity of Leucine-rich Repeat Proteins. J Mol Biol 1998, 277: 519–527.
Stumpp MT, Forrer P, Binz HK, Plckthun A: Designing Repeat Proteins: Modular Leucine-richRepeat Protein Libraries Based on the Mammalian Ribonuclease Inhibitor Family. J Mol Biol 2003, 332: 471–487.
Kubarenko A, Frank M, Weber AN: Structure-function relationships of Toll-like receptor domains through homology modelling and molecular dynamics. Biochem Soc Trans 2007, 35: 1515–1518.
Rössle SC, Bisch PM, Lone YC, Abastado JP, Kourilsky P, Bellio M: Mutational analysis and molecular modeling of the binding of Staphylococcus aureus enterotoxin C2 to a murine T cell receptor Vbeta10 chain. Eur J Immunol 2002, 32: 2172–2178.
Hazai E, Bikádi Z: Homology modeling of breast cancer resistance protein (ABCG2). J Struct Biol 2008, 162: 63–74.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242.
Matsushima N, Kamiya M, Suzuki N, Tanaka T: Super-Motifs of Leucine-Rich Repeats (LRRs) Proteins. Genome Inform 2000, 11: 343–345.
Finn RD, Tate J, Mistry J, Coggill PC, Sammut JS, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A: The Pfam protein families database. Nucleic Acids Res 2008, 36: D281–288.
Mulder NJ, Apweiler R: InterPro and InterProScan: tools for protein sequence classification and comparison. Methods Mol Biol 2007, 396: 59–70.
Schultz J, Copley RR, Doerks T, Ponting CP, Bork P: SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res 2000, 28: 231–234.
Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 2006, 34: D187–191.
Heida N, Hasegawa Y, Mochizuki Y, Hirosawa K, Konagaya A, Toyoda T: TraitMap: an XML-based genetic-map database combining multigenic loci and biomolecular networks. Bioinformatics 2004, 20 Suppl 1: i152-i160.
Kunz H, Derz C, Tolxdorff T, Bernarding J: XML knowledge database of MRI-derived eye models. Comput Methods Programs Biomed 2004, 73: 203–208.
Jiang K, Nash C: Application of XML database technology to biological pathway datasets. Conf Proc IEEE Eng Med Biol Soc 2006, 1: 4217–4220.
eXist-db: an open source database management system[http://exist-db.org]
The World Wide Web Consortium[http://www.w3.org]
Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22: 2577–2637.
Jmol: an open-source Java viewer for chemical structures in 3D[http://www.jmol.org]
Fiser A, Do RK, Sali A: Modeling of loops in protein structures. Protein Sci 2000, 9: 1753–1773.
Laskowski RA, MacArthur MW, Moss DS, Thornton JM: PROCHECK: a program to check the stereochemical quality of protein structures. J Appl Cryst 1993, 26: 283–291.
Liu L, Botos I, Wang Y, Leonard JN, Shiloach J, Segal DM, Davies DR: Structral basis of Toll-like receptor 3 signaling with double-stranded RNA. Science 2008, 320: 379–381.
Maiti R, Van Domselaar GH, Zhang H, Wishart DS: SuperPose: a simple server for sophisticated structural superposition. Nucleic Acids Res 2004, 32: W590–594.
This work was supported by Graduiertenkolleg 1202 of the Deutsche Forschungsgemeinschaft.
TW and JG drafted the manuscript, extracted the data, compiled the database, wrote the code for the web interface and performed the statistical analysis. FJ, WMH, RWS and SCR conceived of the study, built the database server, participated in the database design and coordination and helped to draft the manuscript. TW and JG should be regarded as joint first authors. All authors read and approved the final manuscript.
Tiandi Wei, Jing Gong contributed equally to this work.