Relationships between classes were discovered as a byproduct of cross training. The approach can be broadly divided into two parts. The first part deals with feature extraction and representation of a protein to train the classifiers for both PROSITE and SCOP. The second part involves hierarchical cross training and extraction of relationships between classes of PROSITE and SCOP.
Feature Sets
A variety of features are typically used in training a classifier. These choices are mostly empirical and intuitive and making these choices is a non trivial problem with significant bearing on the accuracy of classification [22]. We have used novel features detailed below to train our classifiers.
Subsequences
Previous attempts have included fixed and variable length subsequences as feature sets [45]. Consecutive and overlapping subsequences of length k are chosen as features. However, k being small would result in lower accuracy, while a large k would lead to overfitting. Therefore, a locallyoptimal value of k was chosen to maximize the accuracy of classifier and enhance its statistical significance.
Subroutine to find optimal k:
Dataset with the primary sequence = DP
meanss = 0
k = 0
while (meanaccuracy increase) ≥ 0 and (ss ≥ meanss) do
k = k + 1
Create D from DP with sequence features of length k
for i = 1 to 10
(TR [i], TE [i]) = Split dataset D in train and test sets
Train a classifier(SVM) CL using training data TR [i]
accuracy [i] = test classifier CL on testing data TE [i]
end for
meanaccuracy = mean of accuracy [i] for i = 1 to 10
Calculate ss for this set using the ttest.
meanss = (meanss*(k1) + ss)/k
end while
The value of statistical significance ss was defined as
\begin{array}{cc}ss=\frac{{(N)}^{1/2}}{\text{Standarderror}},& number\phantom{\rule{0.25em}{0ex}}of\phantom{\rule{0.25em}{0ex}}iterations\text{N}=10\end{array}
(4)
Optimal k was found to be 4 on PROSITE dataset. For a given protein p_{
i
}the count of a klength subsequence f was defined as
Count(f,{p}_{i})={\displaystyle \sum _{j=1}^{j<L/K}Occ(f,j,{p}_{i})\ast Active(j,{p}_{i})}
(5)
where L is the length of the complete protein sequence
Active(j,{p}_{i})=\{\begin{array}{ll}Constant\text{c}(=10),\hfill & j\text{isan}\hfill \\ \text{activesite};\hfill \\ 1,\hfill & \text{otherwise}\hfill \end{array}
(6)
Occ(f,j,{p}_{i})=\{\begin{array}{ll}1\hfill & \text{if}f\text{overlapsposition}\hfill \\ {\text{k}}^{\ast}j\text{in}{p}_{i};\hfill \\ 0,\hfill & \text{otherwise}\hfill \end{array}
(7)
Count is the approximate number of occurrences of the feature f in a protein p_{
i
}. To introduce added weightage to the active sites in the protein, the occurrence, Occ, was counted multiple times (c times). SwissProt [35] entries were used to determine the active site. The value of c was taken as 10 in our experimentations.
Blocks
Blocks were defined as features and count was calculated asCount = Block length/(1 + Block distance)
where distance is the dissimilarity index
with the most conserved
corresponding block
This definition ensures that more weightage is given to larger blocks, which are assumed to preserve more biological information. Further, weightage is inversely proportional to the block distance (dissimilarity index) with the most conserved block [42].
2D elastic profile
Previous attempts to use secondary structure as features for protein classification have been mostly limited to utilization secondary structure content [46, 47], or localized secondary structure [48]. No previous attempt in our knowledge has been made to use the global secondary profile of the protein as a feature. One of the reasons is that proteins have variable lengths which makes the comparison difficult. This problem was solved by introducing a notion of elastic secondary structure. The secondary structure profile was extracted from SwissProt and was linearly scaled to a length of 100 resulting in an 'elastic' profile through stretching or compressing. Here the number 100 was chosen just for convenience. Intuitively, it also behaved like a global feature, as it was not only influenced by changes in the locality but also by additions or deletions at other locations in the protein.
Formally, for a protein p of size L, a secondary structure array was defined as
\begin{array}{c}{2}^{0}\text{struct}[i]=\{\begin{array}{ll}1\hfill & \text{if}strand\text{atlocation}i\hfill \\ 2\hfill & \text{if}turn\text{atlocation}i\hfill \\ 3\hfill & \text{if}helix\text{atlocation}i\hfill \\ NAN\hfill & otherwise\hfill \end{array}\\ \text{for}i\text{from}1\text{to}L.\end{array}
(9)
Then using this array the 2D elastic feature was defined as
\begin{array}{c}2\text{Delastic}[i]={2}^{0}\text{struct}\left[\lfloor L/100\ast i\rfloor \right]\\ \text{for}i\text{from}1\text{to}100\end{array}
(10)
Other Features
Molecular mass, size, percentage of helices, beta strands in the whole protein etc. were other features used for classification. One column/dimension was maintained for each feature. Value of each feature was either equal to the absolute value (like mass in case of molecularmass) or it was binary (1, if the feature was present; 0 otherwise). Equal interval binning was used for many features (e.g. percentage of helices, beta strands etc.) to allow generalization.
Final representation
A protein was represented as a vector of all the above features. This representation is based on an assumption that features are orthogonal to each other. This assumption was made for the sake of time efficiency and to reduce the complexity of algorithm.
Hierarchical Crosstraining
Hierarchical cross training on SVM involves introduction of new artificial dimensions/features to distinguish between the otherwise indistinguishable instances using normal feature sets. So if Aclasses are a good predictors of Bclasses, classification accuracy of proteins in B may be improved by allocating for each protein in B a set of new columns/features, one for each Aclass (Figure 2). Hence, the altered protein {{p}^{\prime}}_{i} is represented as:
{{p}^{\prime}}_{i}=\left(\begin{array}{c}{p}_{i}\\ C{m}_{i}^{T}\end{array}\right)
(11)
Here, C{m}_{i}^{T} refers to the enhanced feature set for each protein in B obtained from classes in A. While adding new dimensions to the protein featurevector, an assumption is made that the kernel space remains orthogonal. Specifically, the new set of dimensions Cm_{
i
}are also orthogonal to all other features. Since protein classes are taken from a hierarchy, this assumption is not entirely true. This concern was addressed by modifying the algorithm and adapting it for hierarchical biological taxonomies.
Firstly, onevsrest SVMs are trained for each class. For training a nonleaf class the positive data used is present within the descendant leaf nodes, while the rest of the data is taken as negative examples. While dealing with a hierarchy during cross training, the basic idea used was that a protein that belongs to a child class also belongs to the corresponding parent class. To be more specific, let p be a protein, c any class and Ansc_{
c
}be set of all classes ancestor to class c. Two cases arise:

1.
Rule 1: p has a high probability to belong to class c: Then p has a high probability to belong to the ancestor classes Ansc_{
c
}too.

2.
Rule 2: p has a low probability to belong to class c: In this particular case, nothing can be said about p's relation with the ancestor classes Ansc_{
c
}.
Cross Train Algorithm:
Train SVMs for Aclasses (C_{
A
}) using
proteins from datasetA (D_{
A
}).
Train SVMs for Bclasses (C_{
B
})
using proteins from datasetB (D_{
B
}).
Each protein p_{
i
}in D_{
B
}is classified
using C_{
A
}and the corresponding
classmembership vector (Cm_{
i
}= (c{m}_{i}^{1},\mathrm{...},c{m}_{i}^{n}))
(classmembership represents the probability of an
instance belonging to various classes in a
taxonomy) is calculated.
Here c{m}_{i}^{j} is the SVM score obtained
by 'testing' protein p_{
i
}with SVM
for the j^{th}class.
UpdateProtein: Using the classmembership Cm_{
i
}
for every protein in B, the protein features
are updated using protein update rule.
Similarly repeat the above steps for proteins in A.
Retrain C_{
A
}using the modified proteins from D_{
A
}.
Retrain C_{
B
}using the modified proteins from D_{
B
}.
Return to step 3 if there is increase in
classification accuracy of C_{
A
}and C_{
B
}.
The above information is incorporated in the protein update rule. Further, it needs to be established when does a given protein belongs to a particular class c with "high probability". One simple way of estimation is by calculating the classmembership vector Cm_{
p
}for any given protein p by testing it with the SVMclassifier for each class. The class with the maximum positive value in Cm_{
p
}is defined as the only class to which the protein p belongs to with "high probability". This method is, however, naive and would miss the correct class in case more than two classes have high and close positive values. Also, during experimentations it was found that in many instances the entire Cm_{
p
}vector is negative and hence no single positive value exists. A softer version was therefore developed which can replace the cross training update rule where Cm_{
p
}was rescaled and then the above two rules were used to update the membership values of the ancestor classes.
Subroutine to update Protein Vector:
I/P : protein p, O/P : updated protein
Calculate the classmembership vector Cm_{
p
}.
Rescaling step: Find maximum classmembership
value val_{
max
}. Add (1  val_{
max
})
to each element in the vector. This step will
ensure a positive value for at least one class.
Identifying high probability classes: Find all
classes C_{
p
}for which classmembership
value is positive.
for every class c ∈ C_{
p
}{
Let the classmembership value of c is val_{
c
}.
Find the ancestor classes Ansc_{
c
}.
Updating ancestor classes: Increase
classmembership for each class in Ansc_{
c
}by val_{
c
}.}
end Subroutine
Extracting relationship using the decision tree
The decision tree [49] algorithm induces a series of comparison in form of a binary tree, where each nonleaf node is expressed as a comparison of a feature f_{
i
}(classes from taxonomy A) value with a constant value. The comparison decides whether to go to either the left or right subtree. The leafnodes are classes to which the instant can belong to (classes from taxonomy B). Hence, if we know the corresponding membership in one taxonomy for a protein, it can be used to find its class in the other taxonomy. The advantage of this approach is that the protein is not required to belong to only a single class and the user can input the strength for each class. A probabilistic weighted score is generated based on the decision tree. We employed the decision tree algorithm to find out the probability of proteins belonging to a class in SCOP to belong to a given class in PROSITE, and vice versa. This created a probability map from SCOP to PROSITE, and vice versa, linking all the classes in either taxonomy to each other with a probabilistic weight. Since PROSITE is a functional classification scheme and SCOP is a structural classification scheme, by corollary, the above probabilistic map can be construed as a probabilistic map between functional and structural properties of proteins.
Subroutine to create decision tree:
A &B are taxonomies.
D_{
A
}= dataset for A after full crosstraining with B.
Calculate classmembership vector Cm_{
i
}∀p_{
i
}∈ D_{
A
}
using classes in B.
Represent every protein p_{
i
}in A using Cm_{
i
}.
Call it {D}_{A}^{d}t.
Train a decision tree DT_{
A
}using this dataset {D}_{A}^{d}t.
Repeat the above steps for B to get decision tree DT_{
B
}.
Each path in DT_{
A
}is a rule
classesinB → classinA.
Each path in DT_{
B
}is a rule
classesinA → classinB.
end Subroutine