Data set of RNA-binding proteins
Primary source of RNA-binding proteins and their annotations into various categories is SCOR database [16]. First, a list of all PDB codes present in SCOR was compiled, resulting in 569 entries. All 569 PDB entries were scanned for RNA (998 chains) and proteins (1435 chains). Protein chains were then scanned to be in direct contact with at least one RNA chain. Proteins with at least 3 residues in contact were selected, resulting in 1242 chains. FASTA-formatted protein sequences were generated from the PDB files and redundancy was removed by clustering them at 25% sequence identity using BLASTCLUST [17]. This resulted in RBP_NR25 database of 160 protein chains, to be subsequently referred to as simply RBP. SCOR functional classification was used to annotate them as binding to mRNA (13 chains), tRNA (20 chains), rRNA (84 chains) or viral RNA (17 chains). Final list of selected protein chains, their calculated moments, along with other data sets, is provided in Additional File 1.
Development of control data sets
First, a non-redundant list of all protein chains in PDB was obtained from PDBselect [18]. The latest (May30-2010 version) PDBselect (25% sequence ID clusters) consisted of 4868 protein chains. From this, chains smaller than 50 residues were removed, which resulted in 4133 protein chains. Next, a keyword search using "Nucleic acid binding" was carried out in SWISSPROT and resulting 20595 proteins chains were obtained in this way. Then, the 4133 chains selected from PDBselect were aligned against all the 20595 SWISSPROT sequences to obtain any similarity, using BLAST at e-values cutoff of 0.01. These chains were excluded from PDB select sequence database. Further PDB entry type was checked and nucleic acid binding chains were removed, leaving 2441 sequences with no similarity to RNA binding proteins with known or unknown structure were obtained. These 2441 protein chains were used as a control data set for all our analysis (see Additional File 1).
Complex versus monomeric structure pairs
Sequence homologues of proteins used in the above data set (RBP_NR25) were searched in PDB with at least 90% sequence identity and the best match was selected. Minimum alignment coverage was also set at 90% and only those target sequences that occurred in monomeric PDB entries were selected.
Calculations of electric moments
Charge, dipole moment and quadrupole moments were calculated as described in our earlier study [13]. According to that study, consideration of all-atom coordinates did not affect the overall results, as compared to the low-resolution model with only backbone coordinates. Thus, in this study, side-chain coordinates of the proteins were ignored and the electric moments were based on the main chain conformation determined by Cα-position of the residues. All Lys and Arg residues were assigned a positive charge and Glu and Asp residues were considered negative. All other residues were treated as neutral: His was considered as neutral, as the consideration of its charged states had negligible effects (see Results section). All water molecules, metals and ligands were also ignored for these calculations.
Components of dipole moments were calculated using the expression
(1)
where Ro is the reference point, which was taken as the geometric center of all the residues (Cα-positions) in the structure, and i represents an atom in the protein structure. Net dipole moment was calculated by taking a vector sum of these components.
Quardupole moment is a tensor of rank 2 and a direct calculation from the PDB coordinates gives nine components (Mxx, Mxy, Mxz, Myx, Myy, Myz, Mzx, Mzy and Mzz). Each of these components is calculated by the following expression
(2)
where ri is the relative position vector, i is the index of charge and summation is over all charges. The quadrupole moment matrix can be diagonalized and the three eigenvalues of the quadrupole moment matrix are represented as Q1, Q2 and Q3 in decreasing order. We used the largest eigenvalue Q1 for designating single quadrupole moment and all three eigen values for developing the predictor.
All electric moment values were the absolute values and normalized by the protein sequence length in a way similar to our earlier study [13]. Units are often omitted in describing quadrupole moments and net charge as these values are measured in atomic units (using electronic charge and Å as charge and distance units in calculations). Dipole moment values are quoted by converting them to Debyes.
Our method of computing electric moments is somewhat different from a similar approach adopted in a recently published dipole moment server [15]. First of all, we use only the Cα atoms for assigning charges, whereas charges are assigned to specific atomic positions in [15]. Secondly, we used geometric center of all Cα atoms (including residues with zero charge assignments to compute the reference point and axes) and finally, we obtain quadrupole moments by taking their eigen values, which is not provided in [15]. We find that there is a moderate correlation (~0.5) between the dipole moments computed by the two methods. Since our approach is more suitable for low resolution structures (does not require side chain positions), we report only the results obtained by our procedure. For similar reasons, we did not try to predict protonation state of residues, which could sometimes be possible if side-chain coordinates are provided [19].
Statistical significance of difference
Distributions of moments between control and RNA-binding as well as between various classes of RNA-binding proteins were compared by measuring the statistical significance of difference between their means. A two-tailed Student t-test was conducted for all such comparisons using open-source statistical programming language R http://r-project.org. Histograms of distributions were also plotted in the same package.
Difference between bound and unbound pairs
For each protein chain in the RBP data set, a data set of monomeric proteins from PDB was scanned. Proteins with more than 90% similarity and coverage values were used as a pair of complexed and unbound monomers. Electric moments were then computed for both of them by the procedure described above. A total of 27 proteins were found to occur both in monomeric as well as RNA-complexed forms.
The difference between electric moments of a protein in its complexed and unbound forms is measured using Euclidean distance (ED) expression as follows:
(3)
Where X refers to dipole or quadrupole moment of the protein and summations is taken over all protein-pairs considered in a category (effectively a distance in 27-dimensional space).
Neural network for prediction
A neural network-based predictor, similar to our earlier implementations (e.g. in [20]) was used to find a relationship between input vectors composed here of five descriptors based on charge, dipole moment and three eigenvalues of quadrupole moment and the functional property of protein chain e.g., binding or non-binding (control). To account for any cooperative and non-linear contribution of moments, a single hidden layer with 3 nodes has been used. To avoid over-assessment of performance, the neural network was trained in a jackknife style, by optimizing the predictor for all but one data in the training. Once the training is completed, prediction on the left-out protein is evaluated. After running through all binding and control proteins, overall prediction performance on the left-out proteins is evaluated. Since the neural network returns a real value between 0 and 1 for the target outputs 0 (non-binding) or 1 (binding), ROC data between specificity and sensitivity is calculated and converted to the area under the curve (AUC) values, which reflects performance over the entire range of cutoffs. Other measures of performance are as follows (T refers to true and F referes to false, whereas P is positive class and N is negative class):
(4)
F-measure is the geometric mean of precision and recall and can be computed by transforming real-valued outputs of neural network into binary class-label predictions at various cutoffs. Cutoffs at which F-measure has the highest value is used for reporting all class-wise performance measures, i.e. precision, recall, accuracy and F-measure.