However, most of the existing hierarchical fea-ture selection methods are not robust for dealing with the inevitable data outliers, resulting in a serious inter-level error propagation p
Trang 1Robust hierarchical feature selection with a capped ‘ 2 -norm
Xinxin Liua,b,c, Hong Zhaoa,b,⇑
a
School of Computer Science in Minnan Normal University, Zhangzhou, Fujian 363000, China
b
Key Laboratory of Data Science and Intelligence Application, Fujian Province University, Zhangzhou, Fujian, 363000, China
c
Fujian Key Laboratory of Granular Computing and Application (Minnan Normal University), Zhangzhou, Fujian, 363000, China
a r t i c l e i n f o
Article history:
Received 11 June 2020
Revised 20 February 2021
Accepted 2 March 2021
Available online 10 March 2021
Keywords:
Inter-level error propagation
Capped ‘ 2 -norm
Data outliers
Feature selection
Hierarchical classification
a b s t r a c t Feature selection methods face new challenges in large-scale classification tasks because massive cate-gories are managed in a hierarchical structure Hierarchical feature selection can take full advantage of the dependencies among hierarchically structured classes However, most of the existing hierarchical fea-ture selection methods are not robust for dealing with the inevitable data outliers, resulting in a serious inter-level error propagation problem in the following classification process In this paper, we propose a robust hierarchical feature selection method with a capped‘2-norm (HFSCN), which can reduce the adverse effects of data outliers and learn relatively robust and discriminative feature subsets for the hier-archical classification process Firstly, a large-scale global classification task is split into several small local sub-classification tasks according to the hierarchical class structure and the divide-and-conquer strategy, which makes it easy for feature selection modeling Secondly, a capped‘2-norm based loss func-tion is used in the feature selecfunc-tion process of each local sub-classificafunc-tion task to eliminate the data out-liers, which can alleviate the negative effects outliers and improve the robustness of the learned feature weighted matrix Finally, an inter-level relation constraint based on the similarity between the parent and child classes is added to the feature selection model, which can enhance the discriminative ability
of the selected feature subset for each sub-classification task with the learned robust feature weighted matrix Compared with seven traditional and state-of-art hierarchical feature selection methods, the superior performance of HFSCN is verified on 16 real and synthetic datasets
Ó 2021 Elsevier B.V All rights reserved
1 Introduction
In this era of rapid information development, the scale of data in
many domains increases dramatically, such as the number of
data are often vulnerable to outliers, which usually decrease the
density of valuable data for a specific task These problems are
challenging to machine learning and data mining tasks such as
classification
On the one hand, high dimensional data bring the curse of
selec-tion is considered as an effective technique to alleviate this
prob-lem[9,10] This method focuses on the features that relate to the
classification task and excludes the irrelevant and redundant ones
On the other hand, data outliers usually disturb the learning
mod-els and reduce the relevance between these selected features and
the corresponding classes This may lead to serious inter-level error propagation, particularly in the following hierarchical
and how to exploit the hierarchical information of classes in fea-ture selection processes are an interesting challenge
Feature selection methods can be categorized into flat feature selection and hierarchical feature selection methods depending
on whether the class hierarchy is considered The flat feature selec-tion method selects one feature subset to distinguish all the classes Thus far, many flat feature selection methods based on
the classical feature selection method Relief based on statistical methods, which selects a relevant feature subset by statistical analysis and uses few heuristics to avoid the complex heuristic
based on the mutual information measure, which selects a feature subset based on the criteria of maximal dependency, maximal
fea-ture selection method based on feafea-ture manifold learning and
pro-posed an effective feature selection method based on the backward
https://doi.org/10.1016/j.neucom.2021.03.002
0925-2312/Ó 2021 Elsevier B.V All rights reserved.
⇑Corresponding author at: School of Computer Science, Minnan Normal
Univer-sity, Zhangzhou, Fujian, 363000, China
E-mail address: hongzhaocn@tju.edu.cn (H Zhao).
Neurocomputing
j o u r n a l h o m e p a g e : w w w e l s e v i e r c o m / l o c a t e / n e u c o m
Trang 2elimination approach for web spam detection Meanwhile, some
flat feature selection methods using different regularization terms
-norm minimization on both the loss function and the
regulariza-tion term to optimize the feature selecregulariza-tion process Lan et al
the effect of data outliers and optimize the flat feature selection
process These flat feature selection methods perform well on
selecting feature subsets for a two-class classification or a
multi-class multi-classification However, these methods fail to consider the
ubiquitous and crucial information of local class relationships
and do not perform well when are applied directly to hierarchical
classification tasks This has been verified by the series of
The hierarchical feature selection method selects several local
feature subsets by taking full advantage of the dependency
rela-tionships among the hierarchically structured classes Relying on
different feature subsets to discriminate among different classes
can help achieve more significant effects in hierarchical
classifica-tion tasks For example, texture and color features are suitable for
distinguishing among different animals, while the edge feature is
more appropriate for discriminating among various furniture items
com-bined the process of feature selection and hierarchical classifier
design with genetic algorithms to improve the classification
hierarchi-cal feature selection algorithm based on the fuzzy rough set theory
These methods can achieve high classification accuracy but fail to
use the dependencies in the hierarchical structure of the classes
a hierarchical feature selection method with three penalty terms:
term to select the common features shared by parent and child
cat-egories; and an independence constraint is added to maximize the
uniqueness between sibling categories Following that work, they
then proposed a recursive regularization based hierarchical feature
relationship constraint and the sibling relationship constraint
on the two-way dependence among different classes and proposed
a hierarchical feature selection with subtree-based graph
regular-ization These methods have a good performance in selecting
fea-ture subsets for large-scale classification tasks with hierarchical
class structures However, these existing hierarchical feature
selec-tion methods are not robust to data outliers and suffer from a
outlier filtering mechanism in these models, and the commonly
used least-squares loss function squares the misclassification loss
of these outliers, which will further aggravate the negative impacts
of these outliers It makes these models achieve relatively low
per-formance when dealing with practical tasks with ubiquitous
outliers
In this paper, we propose a robust hierarchical feature selection
data outliers and selects unique and compact local feature subsets
to control the inter-level error propagation in the following
classi-fication process Firstly, HFSCN decomposes a complex large-scale
classification task into several simple sub-classification tasks
according to the hierarchical structure of classes and the
divide-and-conquer strategy Compared with the initial classification task,
these sub-classification tasks are small-scale and easy to handle,
and only discriminative features need to be retained for the current local sub-classification task Secondly, HFSCN excludes the data
-norm based loss function according to the regression analysis In contrast to the existing hierarchical feature selection methods, HFSCN can improve the robustness of the selected local feature subsets and alleviate the error propagation problem in the classifi-cation process Finally, HFSCN selects a unique and compact fea-ture subset for the current sub-classification task using an inter-level regularization of the parent–child relationship in the feature selection process of the current sub-classification task The depen-dency between the current child sub-classification task and its par-ent sub-classification task is emphasized to drop out the features related to the local sub-classification tasks sharing different parent classes with the current sub-classification task
A series of experiments are conducted to compare HFSCN with seven of the existing hierarchical feature selection methods The experimental datasets consist of two protein sequence datasets, two image datasets, and their 12 corrupted datasets with three types of sample noise Six evaluation metrics are used to discuss the significant differences between our method and the compared methods The experimental results demonstrate that the feature subsets selected by the proposed HFSCN algorithm are superior
to those selected by the compared methods for the classification tasks with hierarchical class structures
we present the basic knowledge of hierarchical classification and feature selection and describe the modeling process of HFSCN in
compared methods, the parameter settings, and some evaluation
and discusses the performance of the compared methods Finally,
and ideas for further study
2 HFSCN method
In this section, we present the proposed robust hierarchical
2.1 Framework of the HFSCN method There are two motivations to design our robust hierarchical fea-ture selection method Firstly, the hierarchical class strucfea-ture in the large-scale classification task has to be taken into account for the prevailing hierarchical management of numerous classes Sec-ondly, the adverse effects of noises such as data outliers, which may result in a serious inter-level error problem in the following hierarchical classification, have to be reduced in the optimization process
A framework of HFSCN based on these considerations is
pro-cess of HFSCN can be roughly decomposed into the following two steps:
(1) Divide a complex large-scale classification task into a group
of small sub-classification tasks according to the divide-and-conquer strategy and the class’s hierarchical information (2) Develop a robust hierarchical feature selection for each sub-classification task, considering the elimination of the outliers
constraints
Trang 32.2 Hierarchical classification
In most real-world and practical large-scale classification tasks,
categories are usually managed in a hierarchical structure A tree
structure and a directed acyclic graph structure are two common
representations of the class hierarchical information In this study,
we focus on the classes with a hierarchical tree structure
The hierarchical tree structure of classes is usually defined by
the subclass of the latter The ‘‘IS-A” relationship has the following
– the root class node is the only significant element in the tree
structure of classes;
–8ci; cj2 CT, if ci cj, then cj: ci;
–8ci2 CT; ci: ci;
–8ci; cj; ck2 CT, if ci cjand cj ck, then ci ck;
subclass of the latter That is, the former class is not a child of the
object classes with hierarchical information is represented by a tree
The root class Object, which contains all of the classes below it,
is the only large node There are several internal class nodes, which have parent coarse-grained class nodes and child fine-grained class nodes For instance, the Furniture class has the child class set of Seating and Dining table Class nodes without a child node are ter-med ‘‘leaf class nodes” The root node and all of the internal nodes are called ‘‘non-leaf class nodes” Moreover, the classification pro-cess of all the samples stops at the leaf node in the experiments; i.e., leaf node classification is mandatory Several of the examples
the transmission properties of ‘‘IS-A” (1) The asymmetry property: Sofa is a type of Seating, but it is incorrect that all seating are Sofa (2) The transmission property implies that Chair belongs to Seating and Seating belongs to Furniture, so Chair belongs to Furniture as well In this case, class hierarchies in all hierarchical classification tasks satisfy the four properties mentioned above
A classification task with object classes managed in a hierarchi-cal tree structure is hierarchi-called hierarchihierarchi-cal classification A sample is classified into one class node at each level in turn in a coarse-to-fine fashion In the hierarchical tree structure of classes, the root
Fig 1 Framework of HFSCN Firstly, a complex large-scale classification task is divided into some sub-classification tasks with small scales and different inputs and outputs according to the divide-and-conquer strategy and the hierarchy of the classes Then, a corresponding training dataset is grouped for these subtasks from bottom to top along the hierarchy of classes Finally, robust and discriminative feature subsets are selected recursively for those subtasks by the capped ‘ 2 -norm based noise filtering mechanism and the relation constraint between the parent class and its child classes.
Fig 2 Hierarchical tree structure of object classes of the VOC dataset [30]
Trang 4and internal nodes are abstract coarse-grained categories
summa-rized from their child classes The class labels of the training
sam-ples correspond to the leaf classes, which are fine-grained
categories The closer a node to the root node, the coarser the
gran-ularity of the category As a result, a sample belongs to several
cat-egories from the coarse-grained level to the fine-grained level
However, most hierarchical classification methods have one
gen-eral and serious problem, called inter-level error propagation;
the classification errors at the parent class are easily transported
to its child classes and propagated to the leaf classes along with
2.3 Decomposition of a complex classification task
A divide-and-conquer strategy is used to divide a complex
clas-sification task with a hierarchical class structure A group of small
sub-classification tasks corresponding to the non-leaf classes in the
hierarchical class structure can be obtained according to the
decomposition process The fine-grained classes under the child
classes of a non-leaf class are ignored, and only these direct child
classes of this non-leaf node are included in the searching space
for the corresponding sub-classification task For example, the
class For the sub-classification task corresponding to the non-leaf
class 0, we only need to distinguish its four direct child classes
(Vehicles, Animal, Household, and Person), and do not discriminate
the fine-grained classes under Vehicles, Animal, and Household
according to above task decomposition process
Therefore, each sub-classification task’s searching space is
sig-nificantly decreased, which makes it simple to model the feature
selection and the classification process
Meanwhile, the classification task is represented according to
number of child nodes in all the sub-classification tasks to facilitate the calculation The class label matrix set is redefined as follows:
Y¼ Yf 0; ; Yig, where Yi¼ 0; 1f g 2Rm i C max 2.4 Robust hierarchical feature selection method One unique feature selection process for each small sub-classification task is obtained by the aforementioned task
the feature matrix and the class label matrix
Not all of the dimensions of object features are suitable for pre-dicting a specific category We select a unique and compact feature subset and drop out the relatively irrelevant features for each sub-classification task to alleviate the curse of the dimensionality prob-lem Feature selection methods can be categorized into three groups according to different selection strategies: filter feature selection, wrapper feature selection, and embedding feature selec-tion In this study, we focused on the third one, namely embedding feature selection Different norm-based regularization terms of the feature weighted matrix W are usually used as penalty terms in the embedding feature selection
The feature selection model for the i-th sub-classification task can be formulated as a common penalized optimization problem
as follows:
min
W i
L Xð i; Yi; WiÞ þ kR Wð iÞ; ð1Þ
Fig 3 Classification task of VOC is divided into several small sub-classification tasks.
Trang 5The common and traditional empirical loss functions include
the least squares loss and the logistic loss Assume that the
classi-cal least squares losskXiWi Yik2
feature selection model for the i-th sub-classification task, where
exiting in the training set, the classification loss are particularly
iWi yj
to a serious inter-level error propagation problem in the following
hierarchical classification process An outlier is a case that does not
follow the same model as the rest of the data and appears as
In order to reduce the adverse effects of data outliers, we use
to remove the outliers according to the regression analysis of the
features and the class labels:
min
iWi yj
i
2;ei
sub-classification task No matter how serious classification error
caused by a data point, the classification loss of this data point is
learned feature weighted matrix are considerably reduced to obtain
a robust and discriminative feature subset for the i-th
sub-classification task
sub-classification task Firstly, the losses of all the training samples
for the i-th sub-classification task For example, the classification
losses of 20 samples are calculated according to the loss function
Then, the obtained 20 losses are sorted, and the following
descending-order sequence is obtained: [92.1, 87.4, 50.3, 29.9,
the classification losses for the data outliers are limited to 50.3 at
most; these serious data outliers are eliminated by the capped
discriminative ability and decrease the inter-level errors in the
hierarchical classification process
The following are two penalty terms used in our feature
sim-ilarity between the coarse-grained parent class and the
fine-grained child class
the model to select a compact and unique local feature subset for
discriminating the classes in the current i-th sub-classification
task, and discard features that are suitable for distinguishing the
categories in other sub-classification task This penalty term is
called the structural sparsity for the classes in the current subtask
For the i-th sub-classification task, the feature selection objective
function with the sparsity regularization term is as follows:
min
iWi yj
i
2;ei
þ k Wk ik2;1; ð3Þ
large weights across all the classes, and the features with small weights are not selected
An inter-level relation regularization term defined according to
used to optimize the feature selection process of the i-th sub-classification task A coarse-grained parent class is abstracted and generalized from the fine-grained child classes under it The sam-ples in one child class have to first be classified into a coarser-grained category (the parent class) and then into the current class Therefore, it is reasonable to believe that the selected feature sub-set for a child sub-classification task is similar to that selected for its parent sub-classification task The features related to the classes sharing different parent classes with the current class need to be
F
for the convenience of calculation This penalty term is called the inter-level constraint and added to the objective function Thus, the final objective function for the i-th sub-classification task can
be expressed as follows:
min
W i
min xj
iWi yj i
2;ei
þ k Wk ik2;1þaWi Wpi2
Finally, the hierarchical feature selection objective function of the entire hierarchical classification task is written as follows:
min
W i
XN i¼0
min xjiWi yj
i
2;ei
þ k Wk ik2;1
þaXN
i¼1
Wi Wpi
2.5 Optimization of HFSCN
In this section, we describe the optimization process for solving the objective function of HFSCN Assume diagonal matrices
follow-ing values, respectively:
djji ¼ 1
2 w ji
2
sjj
2 x
j
iWi yj i
1
iWi yj i
26ei
class label vector of the j-th sample in the i-th feature selection
Indð Þ ¼ 1; if xj
iWi yj i
26ei; 0; otherwise:
8
<
The hierarchical feature selection objective function can be rewritten as:
min
Wi
X N i¼0
Tr X ð i W i Y i Þ T S i ð X i W i Y i Þ
þ kTr W T
i D i W i
þaXN
i¼1
W i W pi
The optimization objective function of the root node (the 0-th sub-classification task) needs to be updated separately because it has no parent class The objective function of the root node can
be expressed as follows:
Trang 6W 0 ;W i
X0W0 Y0
S0ðX0W0 Y0Þ þ kTr WT
0D0W0
þaXj jC0
i¼1
Tr Wð i W0ÞT
Wi W0
2 X T0S0X0þ kD0þaj jIC0
W0 2 XT
0S0Y0þaXj jC0
i¼1
Wi
!
¼ 0; ð11Þ
W0¼ XT
0S0X0þ kD0þaj jIC0
XT0S0Y0þaXj jC0
i¼1
Wi
!
Then, the following are the optimization processes of the other
sub-classification tasks that have parent classes The feature selection
objective function of the i-th sub-classification task can be
expressed as follows:
min
W i ;W pi
Tr Xð iWi YiÞTSiðXiWi YiÞ
þ kTr WT
iDiWi
þaTr Wi Wpi
Wi Wpi
follows:
2 XTiSiXiþ kDiþaI
Wi 2 XT
iSiYiþaWpi
Wi¼ XT
iSiXiþ kDiþaI
XTiSiYiþaWpi
Algorithm 1: Robust hierarchical feature selection with a capped ‘ 2 -norm (HFSCN).
Input:
(1) Data matrices X i 2 R m i d ; Y i 2 R m i C max and C T ;
(2) Sparsity regularization parameter k > 0;
(3) Paternity regularization coefficienta> 0;
(4) Outlier percentageeP 0.
Output:
(1) Feature weighted matrix set W ¼ W f 0 ; W 1 ; ; W N g, where W i 2 R dC max ;
(2) Selected feature subsets F ¼ F f 0 ; F 1 ; ; F N g
1: Set iteration number t¼ 1; initialize W ð Þ t with the each element of Wð Þit is 1;
i and S ð Þ t
i according to Eqs (6) and (7) , respectively;
// Update the feature weighted matrix for the 0-th sub-classification task
6:
Update W 0 :Wð0tþ1Þ¼ X T
0 Sð Þ0tX 0 þ kD ð Þ t
0 þaj jI C 0
X T
0 Sð Þ0tY 0 þaPj j C 0
i¼1 Wð Þit
; // Update the feature weighted matrix for the i-th sub-classification task
8:
Update W i : Wðitþ1Þ¼ X T
i S i X i þ kD i þaI
X T
i S i Y i þaW pi
;
10: Update Wð tþ1 Þ ¼ W ð tþ1 Þ
0 ; W ð tþ1 Þ
1 ; ; W ð tþ1 Þ N
;
13: for i ¼ 0 to N do
14: Rank w j
2 (j ¼ 1; ; n) in descending order for the i-th sub-classification
task;
15: Select the top ranked feature subset F i for the i-th sub-classification task;
The feature selection process of the initial hierarchical
obtained for the whole global hierarchical classification task The
n features for the i-th sub-classification task are sorted according
to wj
top-ranked features are selected
Based on the above analysis, the detailed algorithm to solve the
eliminated according to the least squares regression analysis and
weighted matrices are then calculated according to the sparse reg-ularization and the parent–child relationship constraint, as listed from Line 6 to Line 10 The iteration of the two aforementioned processes continue until the convergence of the objective function
the hierarchical classification task through the processes in Lines
13 and 14 It takes approximately six iterations before convergence
in the experiments
In this section, we introduce the experiment setup from the fol-lowing four aspects: (1) the real and synthetic experimental data-sets; (2) the compared methods; (3) the evaluation metrics used to discuss the performance of all methods; and (4) the parameter set-tings in the experiments
All experiments are conducted on a PC with 3.40 GHz Intel Core i7-3770, 3.40 GHz CPU, 16 GB memory, and Windows 7 operating
github.com/fhqxa/HFSCN
3.1 Datasets Experiments are conducted on four practical datasets (including two protein sequence datasets, one object image dataset, and one fine-grained car image dataset) from the machine learning reposi-tory and their 12 corrupted synthetic datasets These datasets from different application scenarios provided a good test ground for the evaluation of the performance of different hierarchical feature selection methods Detailed information about the initial datasets
follows:
(1) DD and F194 are two protein sequence datasets from Bioin-formatics Their samples are described by 473 features
protein hierarchy, namely class and fold, are used in the experiment The DD dataset contains 27 various protein folds, which originate four major structural classes:
sequences from 194 folds which belong to seven classes
bench-mark dataset in visual object category recognition and detection It contains 12,283 annotated consumer pho-tographs collected from the Flickr photo-sharing website, where 1,000 features are extracted to characterize each
Trang 7image sample Its class hierarchy is shown inFig 2, and the
leaf nodes correspond to the 20 fine-grained car categories
of cars It consists of 196 classes and 15,685 images, covering
sedans, SUVs, coupes, convertibles, pickups, hatchbacks, and
station wagons Each image sample is characterized by 4,096
hier-archical class structure of Car196 and the leaf nodes
corre-spond to the 196 fine-grained car classes
For the corrupted datasets, some sample outliers are generated
according to different distributions and then included in the initial
training datasets The quantity of sample outliers in each corrupted
training set is 10% of the number of samples in the corresponding
initial training set But the test sets in all the corrupted datasets are
the same as those in all the initial datasets and not corrupted with
data outliers In addition, the feature dimensionality and
hierarchi-cal structures of the classes are not changed Three common
distri-butions are used to obtain three new synthetic datasets for each
initial dataset The detailed information about these 16 initial
3.2 Compared methods
HFSCN is compared with the baselines and seven hierarchical
feature selection methods in the hierarchical classification tasks
to evaluate the effectiveness and the efficiency of HFSCN for
hier-archical feature selection Top-down SVM classifiers are used in the
classification experiments The compared methods are as follows:
(1) Baselines, including F-Baseline and H-Baseline, classify the
test samples with all the initial features F-Baseline assumes
that the categories are independent of each other and
directly distinguishes among all the categories Different
from F-Baseline, H-Baseline considers the semantic
hierar-chy between the categories and classifies the samples from
coarse to fine
(2) HRelief is a hierarchical feature selection method that is a
instance-based learning
(3) HFisher uses the classical feature selection method Fisher
hierar-chy, where the Fisher score selects the features of the
labeled training data on the basis of the criterion of the best
discriminatory ability and can avoid the process of heuristic
search
(4) HFSNM is a hierarchical selection method that selects
mini-mization on both the loss function and the regularization
on all the data points In this method, a sparsity
regulariza-tion parameter needs to be predefined
method mRMR selects a feature subset based on the
minimal-redundancy and maximal-relevance criteria and avoids the difficult multivariate density estimation in maxi-mizing dependency
sibling relationships in the class hierarchy to optimize the feature selection process
parent–child relationship only during the hierarchical fea-ture selection process
(8) HFS-O is a hierarchical feature selection method modified
HFS-O uses the sparsity regularization to obtain the feature
fil-ter out the data outliers The parent–child and sibling
consideration in its feature selection process
There are three methods HFSRR-PS, HFSRR-P, and HFS-O related
among these three compared methods and HFSCN in terms of the outlier filtering loss function, parent–child relationship con-straint, and sibling relationship constraint We compare HFSCN with HFSRR-PS, HFSRR-P, and HFS-O as follow:
(1) Compared with HFSRR-PS and HFSRR-P, HFSCN is robust to deal with the inevitable data outliers The outlier filtering
func-tion can improve the discriminative ability of selected local feature subset and further alleviate the error propagation problem in the hierarchical classification process
(2) The sibling relationship constraint between classes is not used in HFSCN given its relatively minor effect improvement
in HFSRR-PS compared with HFSRR-P and the model com-plexity of HFSCN
(3) In contrast to HFS-O, HFSCN uses the similarity constraint between the parent and child classes to select a unique and compact local feature subset recursively for a
sub-Table 1
Initial dataset description.
Fig 4 Hierarchical tree structure of the object classes in DD.
Trang 8classification task However, the relation among classes is
not taken into consideration and local feature subsets are
selected independently and respectively in HFS-O
3.3 Evaluation metrics
experiments to evaluate the performance of our method and the
compared methods On the one hand, the running time of the
fea-ture selection process and the classification time for testing the
selected feature subsets are evaluated for these methods On the
other hand, the effectiveness of the selected feature subsets for
classification is evaluated in terms of the classification accuracy,
-measure based on the least common ancestor These metrics are
described as follows:
selecting feature subsets by the different feature selection
algorithms F-Baseline and H-Baseline input all the initial
features directly in the classification without the feature
selection process Therefore, only the proposed method
and the other seven hierarchical feature selection methods
are compared based on this evaluation metric
clas-sifier learning and prediction process on the test set with the selected features
simplest evaluation metric for flat or hierarchical classifica-tion It is calculated as the ratio of the number of correctly predicted samples to the total number of test samples
for hierarchical classification models include hierarchical
PH¼j bDaug\ Daugj
j bDaugj ; RH¼j bDaug\ Daugj
jDaugj ;
FH¼2 PH RH
An Dð Þ; ^Daug¼ bDS
An b D
; D is the real label
of the test sample, An(D) represents the parent node set of
indi-cates the predicted class label of the test sample
Fig 5 Hierarchical tree structure of object classes in F194.
Fig 6 Hierarchical tree structure of object classes in Car196.
Trang 9(5) Tree induced error (TIE) In hierarchical classification, we
should give different punishments on different types of
clas-sification errors In the proposed model, the penalty is
defined by the distance of nodes, which is termed the Tree
Hcp; cr
is the
is, TIE cp; cr
is the number of edges along the path from
This metric is based on but is different from the hierarchical
least common ancestor (LCA) is derived from the graph
as the least common
,
LCA c p; cr
consid-ering the least common ancestor
3.4 Parameter settings Some experimental parameters in the compared methods have
to be set in advance Parameter k of HRelief is varied over the set
1; 2; 3; 4; 5; 6; 7
result for each classification task The parameters shared by HFSCN and the compared methods except HRelief are set to the same
HFSNM, HFSRR-PO, HFSRR-P, HFS-O; (2) the paternity
For HFSCN, the three parameters are determined by a grid
3%, 4%, 5%, 6%, 8%, 10%, 12%, 14%, 16%, 18%, 20%} Different param-eter values with the best results are dparam-etermined for different data-sets A top-down C-SVC classifier modified from the classical
test-ing the selected features
In addition, all experiments are conducted using the 10-fold cross-validation strategy, and the average result is reported for each method
4 Experimental results and analysis
In this section, we present the experimental results and discuss the performance of the proposed method from the following five aspects: (1) the classification results with different numbers of fea-tures selected by HFSCN; (2) the effects of the outlier filtering mechanism and the inter-level constraint in HFSCN; (3) two run-ning time comparisons for feature selecting and feature testing; (4) performance comparisons using the flat and hierarchical
among HFSCN and the compared methods are evaluated statisti-cally; and (5) the convergence analysis of HFSCN
In all the tables presenting the experimental results, the best result is marked in bold, the next best result is underlined The mark ‘‘"” represents that the bigger the value, the better the perfor-mance; ‘‘#” indicates that the smaller the better
4.1 Classification results with different numbers of selected features
An experiment on four initial real datasets is conducted to ver-ify the changes of each evaluation metric when classver-ifying samples with different numbers of features selected by HFSCN The exper-imental results show that the changes of the results on each metric
Table 2
Detailed information about 16 experimental datasets.
No Dataset Sample group Feature description of sample outliers
(a) Initial and corrupted datasets for DD
distribution N 0; 1 ð Þ
distribution U 0; 1 ð Þ (b) Initial and corrupted datasets for F194
distribution N 0; 1 ð Þ
distribution U 0; 1 ð Þ (c) Initial and corrupted datasets for VOC
distribution N 0; 1 ð Þ
distribution U 0; 1 ð Þ (d) Initial and corrupted datasets for Car196
Car196-R
Car196 + Random Random noise in 0; 10 ½
Car196-N
Car196 + Gaussian Random noise obeys Gaussian
distribution N 0; 1 ð Þ
Car196-U
Car196 + Uniform Random noise obeys uniform
distribution U 0; 1 ð Þ
Table 3
Differences among HFSRR-PS, HFSRR-P, HFS-O, and HFSCN.
Trang 10are consistent, so only the FHresults of the proposed method and
the compared methods on four initial datasets are presented in
Fig 7 From these results, we can obtain the following two
observations:
increasing as more and more features are selected In
addi-tion, the results no longer change or change little when
HFSCN selecting more than 47 features (approximately
10%)
(2) On the two image datasets VOC and Car196, the
classifica-tion results are relatively good and stable when more than
20% of features are selected by the proposed method
Based on these results, we select 10% of features for the protein
datasets and 20% of features for the image datasets
4.2 Effects of the outlier filtering mechanism and inter-level constraint
The first phase of the experiment explores the effects of HFSCN
on controlling the inter-level error propagation by removing the
data outliers During the feature selection process, the values of
these synthetic datasets is shown as red dotted lines), we can
obtain the following three observations:
(1) When the percentage of data outliers is set to or almost close
to 10% (the exact outlier ratio in synthetic datasets), HFSCN
obtains the best classification performance The larger the
of HFSCN This result is likely because some samples that are important and discriminative to the classes are removed (2) The proposed algorithm’s performance is poor when the
the data outliers can lower the learning ability of algorithms The second stage of the experiment explores the effects of the proposed algorithm on avoiding inter-level propagation The
fixed for different classification datasets The parent–child rela-tionship constraint is emphasized in varying degrees The
conclusions:
Therefore, an appropriate paternity constraint can help HFSCN to achieve more effective feature subsets and avoid inter-level error propagation
(2) An overemphasis on the shared common features of the par-ent class and child class will neglect the uniqueness between categories This conclusion is not very apparent for the
VOC-R dataset This could be attributed to the fact that the value
4.3 Running time comparisons for selecting and testing features
We discuss the performance of HFSCN and the compared meth-ods on two metrics: the running time for selecting features and the running time for testing the discriminative ability of the selected features in classification processes
Fig 7 F H results of HFSCN with different numbers of selected features: (a) DD; (b) F194; (c) VOC; (d) Car196.