Relationships between classes were discovered as a by-product of cross training. The approach can be broadly divided into two parts. The first part deals with feature extraction and representation of a protein to train the classifiers for both PROSITE and SCOP. The second part involves hierarchical cross training and extraction of relationships between classes of PROSITE and SCOP.
Feature Sets
A variety of features are typically used in training a classifier. These choices are mostly empirical and intuitive and making these choices is a non trivial problem with significant bearing on the accuracy of classification [22]. We have used novel features detailed below to train our classifiers.
Subsequences
Previous attempts have included fixed and variable length subsequences as feature sets [45]. Consecutive and overlapping subsequences of length k are chosen as features. However, k being small would result in lower accuracy, while a large k would lead to over-fitting. Therefore, a locally-optimal value of k was chosen to maximize the accuracy of classifier and enhance its statistical significance.
Subroutine to find optimal k:
Dataset with the primary sequence = DP
mean-ss = 0
k = 0
while (mean-accuracy increase) ≥ 0 and (ss ≥ mean-ss) do
k = k + 1
Create D from DP with sequence features of length k
for i = 1 to 10
(TR [i], TE [i]) = Split dataset D in train and test sets
Train a classifier(SVM) CL using training data TR [i]
accuracy [i] = test classifier CL on testing data TE [i]
end for
mean-accuracy = mean of accuracy [i] for i = 1 to 10
Calculate ss for this set using the t-test.
mean-ss = (mean-ss*(k-1) + ss)/k
end while
The value of statistical significance ss was defined as
(4)
Optimal k was found to be 4 on PROSITE dataset. For a given protein p
i
the count of a k-length subsequence f was defined as
(5)
where L is the length of the complete protein sequence
(6)
(7)
Count is the approximate number of occurrences of the feature f in a protein p
i
. To introduce added weightage to the active sites in the protein, the occurrence, Occ, was counted multiple times (c times). SwissProt [35] entries were used to determine the active site. The value of c was taken as 10 in our experimentations.
Blocks
Blocks were defined as features and count was calculated asCount = Block length/(1 + Block distance)
where distance is the dissimilarity index
with the most conserved
corresponding block
This definition ensures that more weightage is given to larger blocks, which are assumed to preserve more biological information. Further, weightage is inversely proportional to the block distance (dissimilarity index) with the most conserved block [42].
2-D elastic profile
Previous attempts to use secondary structure as features for protein classification have been mostly limited to utilization secondary structure content [46, 47], or localized secondary structure [48]. No previous attempt in our knowledge has been made to use the global secondary profile of the protein as a feature. One of the reasons is that proteins have variable lengths which makes the comparison difficult. This problem was solved by introducing a notion of elastic secondary structure. The secondary structure profile was extracted from SwissProt and was linearly scaled to a length of 100 resulting in an 'elastic' profile through stretching or compressing. Here the number 100 was chosen just for convenience. Intuitively, it also behaved like a global feature, as it was not only influenced by changes in the locality but also by additions or deletions at other locations in the protein.
Formally, for a protein p of size L, a secondary structure array was defined as
(9)
Then using this array the 2-D elastic feature was defined as
(10)
Other Features
Molecular mass, size, percentage of helices, beta strands in the whole protein etc. were other features used for classification. One column/dimension was maintained for each feature. Value of each feature was either equal to the absolute value (like mass in case of molecular-mass) or it was binary (1, if the feature was present; 0 otherwise). Equal interval binning was used for many features (e.g. percentage of helices, beta strands etc.) to allow generalization.
Final representation
A protein was represented as a vector of all the above features. This representation is based on an assumption that features are orthogonal to each other. This assumption was made for the sake of time efficiency and to reduce the complexity of algorithm.
Hierarchical Cross-training
Hierarchical cross training on SVM involves introduction of new artificial dimensions/features to distinguish between the otherwise indistinguishable instances using normal feature sets. So if A-classes are a good predictors of B-classes, classification accuracy of proteins in B may be improved by allocating for each protein in B a set of new columns/features, one for each A-class (Figure 2). Hence, the altered protein is represented as:
(11)
Here, refers to the enhanced feature set for each protein in B obtained from classes in A. While adding new dimensions to the protein feature-vector, an assumption is made that the kernel space remains orthogonal. Specifically, the new set of dimensions Cm
i
are also orthogonal to all other features. Since protein classes are taken from a hierarchy, this assumption is not entirely true. This concern was addressed by modifying the algorithm and adapting it for hierarchical biological taxonomies.
Firstly, one-vs-rest SVMs are trained for each class. For training a non-leaf class the positive data used is present within the descendant leaf nodes, while the rest of the data is taken as negative examples. While dealing with a hierarchy during cross training, the basic idea used was that a protein that belongs to a child class also belongs to the corresponding parent class. To be more specific, let p be a protein, c any class and Ansc
c
be set of all classes ancestor to class c. Two cases arise:
-
1.
Rule 1: p has a high probability to belong to class c: Then p has a high probability to belong to the ancestor classes Ansc
c
too.
-
2.
Rule 2: p has a low probability to belong to class c: In this particular case, nothing can be said about p's relation with the ancestor classes Ansc
c
.
Cross Train Algorithm:
Train SVMs for A-classes (C
A
) using
proteins from dataset-A (D
A
).
Train SVMs for B-classes (C
B
)
using proteins from dataset-B (D
B
).
Each protein p
i
in D
B
is classified
using C
A
and the corresponding
class-membership vector (Cm
i
= )
(class-membership represents the probability of an
instance belonging to various classes in a
taxonomy) is calculated.
Here is the SVM score obtained
by 'testing' protein p
i
with SVM
for the jthclass.
Update-Protein: Using the class-membership Cm
i
for every protein in B, the protein features
are updated using protein update rule.
Similarly repeat the above steps for proteins in A.
Retrain C
A
using the modified proteins from D
A
.
Retrain C
B
using the modified proteins from D
B
.
Return to step 3 if there is increase in
classification accuracy of C
A
and C
B
.
The above information is incorporated in the protein update rule. Further, it needs to be established when does a given protein belongs to a particular class c with "high probability". One simple way of estimation is by calculating the class-membership vector Cm
p
for any given protein p by testing it with the SVM-classifier for each class. The class with the maximum positive value in Cm
p
is defined as the only class to which the protein p belongs to with "high probability". This method is, however, naive and would miss the correct class in case more than two classes have high and close positive values. Also, during experimentations it was found that in many instances the entire Cm
p
vector is negative and hence no single positive value exists. A softer version was therefore developed which can replace the cross training update rule where Cm
p
was re-scaled and then the above two rules were used to update the membership values of the ancestor classes.
Subroutine to update Protein Vector:
I/P : protein p, O/P : updated protein
Calculate the class-membership vector Cm
p
.
Rescaling step: Find maximum class-membership
value val
max
. Add (1 - val
max
)
to each element in the vector. This step will
ensure a positive value for at least one class.
Identifying high probability classes: Find all
classes C
p
for which class-membership
value is positive.
for every class c ∈ C
p
{
Let the class-membership value of c is val
c
.
Find the ancestor classes Ansc
c
.
Updating ancestor classes: Increase
class-membership for each class in Ansc
c
by val
c
.}
end Subroutine
Extracting relationship using the decision tree
The decision tree [49] algorithm induces a series of comparison in form of a binary tree, where each non-leaf node is expressed as a comparison of a feature f
i
(classes from taxonomy A) value with a constant value. The comparison decides whether to go to either the left or right subtree. The leaf-nodes are classes to which the instant can belong to (classes from taxonomy B). Hence, if we know the corresponding membership in one taxonomy for a protein, it can be used to find its class in the other taxonomy. The advantage of this approach is that the protein is not required to belong to only a single class and the user can input the strength for each class. A probabilistic weighted score is generated based on the decision tree. We employed the decision tree algorithm to find out the probability of proteins belonging to a class in SCOP to belong to a given class in PROSITE, and vice versa. This created a probability map from SCOP to PROSITE, and vice versa, linking all the classes in either taxonomy to each other with a probabilistic weight. Since PROSITE is a functional classification scheme and SCOP is a structural classification scheme, by corollary, the above probabilistic map can be construed as a probabilistic map between functional and structural properties of proteins.
Subroutine to create decision tree:
A &B are taxonomies.
D
A
= dataset for A after full cross-training with B.
Calculate class-membership vector Cm
i
∀p
i
∈ D
A
using classes in B.
Represent every protein p
i
in A using Cm
i
.
Call it .
Train a decision tree DT
A
using this dataset .
Repeat the above steps for B to get decision tree DT
B
.
Each path in DT
A
is a rule
classes-in-B → class-in-A.
Each path in DT
B
is a rule
classes-in-A → class-in-B.
end Subroutine