An Ensemble Feature Selection Algorithm for Machine Learning Based Intrusion Detection System An Ensemble Feature Selection Algorithm for Machine Learning based Intrusion Detection System Phuoc Cuong[.]
Trang 1An Ensemble Feature Selection Algorithm for Machine Learning based Intrusion Detection System
Phuoc-Cuong Nguyen∗, Quoc-Trung Nguyen†, Kim-Hung Le‡ University of Information Technology, Vietnam National University Ho Chi Minh City
Ho Chi Minh, Vietnam Email:∗18520545@gm.uit.edu.vn,†18521553@gm.uit.edu.vn,‡hunglk@uit.edu.vn
Abstract—In recent years, we have witnessed the significant
growth of the Internet along with emerging security threats
A machine learning-based Intrusion Detection System (IDS)
is widely employed to detect cyber attacks by continuously
monitoring network traffic However, the diversity of network
features considerably affected the accuracy and training time
of the IDS model In this paper, a lightweight and effective
feature selection algorithm for IDS is proposed This algorithm
combines the advantages of both Random Forest and AdaBoost
algorithms The evaluation results on popular datasets (
NSL-KDD, UNSW-NB15, and CICIDS-2017) show that our proposal
outperforms existing feature selection algorithms regarding the
detection accuracy and the number of selected features
Index Terms—Intrusion Detection System, feature selection
algorithm, machine learning, random forest, adaboost
I INTRODUCTION
The Internet has exponentially increased and transformed
every aspect of our daily lives According to IWS (Internet
World Stats), from June 2017 to March 2021, Internet users
in-creased from 3885 million to 5168 million This is a significant
number because it shows that 65.6% of the world’s population
uses the Internet [1] For most people, the above numbers are
nothing special, but it is like a gold mine for cybercriminals
to exploit A specific example for this is when the covid
pandemic broke out, the number of Internet users skyrocketed
when many jobs were moved online; taking advantage of that,
the attackers kept exploiting and attacking Internet users [2]
According to McAfee’s statistics for the global damage caused
by cyberattacks, in 2018, the damage caused by cybercrime
was $500 billion [3] By 2020, this number has nearly doubled,
about $945 billion Therefore, the demand for the use of IDS
systems is increasing day by day
An IDS is a device or software running on internet
gate-ways to detect malicious activity or policy violations
Cyber-attacks usually aim at our digital properties When an attack
occurs, they try to obtain confidential data, modify it, or make
the system stop providing service Therefore, IDS plays an
essential role in detecting and preventing large numbers of
security threats and maintaining confidentiality, integrity, and
availability [4] However, most existing IDS cannot handle
complex and continuous variable attacks Classical IDS (e.g.,
firewalls, access control mechanisms) still have limitations in
comprehensively protecting the network and system from
com-plex attacks, for example, DDOS [5] To solve this problem,
applying machine learning is a potential solution because it
can increase detection intensity, reduce the false alarm rate, and better adapt to ensemble attacks
An IDS system has to deal with huge volumes of data, in-cluding false positives and incompatible or redundant features These features not only reduce the detection process but also consume a significant amount of resources Thereby, setting
a new development direction on using feature selection could increase accuracy, training, and testing speed [6] It also helps
to solve some of the common problems in IDS by detecting confounding features, resulting in lowering operating costs and storage space
The filter method is a popular approach to select the features for IDS It selects the best sub-feature set by using a different technique called variable ranking All features are scored by
a suitable ranking criterion based on a threshold Any feature that has a score below the threshold is removed [7] However, the performance of the filter method highly depends on the threshold that is data-specific It means that choosing the correct threshold is very challenging To solve this issue, in this paper, we introduce a novel feature selection algorithm that combines two ensemble algorithms: Random forest and Ada boost At first, we select sub-feature set S0, then apply ensemble algorithm and use our criterion to evaluate We repeat this process until covering most of the cases Then we use a clustering algorithm to select out those sub-features sets that tended to provide the best performance
The rest of this paper is organized as follows The related works are presented in Section II We briefly introduce our proposal in Section III Section IV describes the evaluated datasets and experiment results In Section V, we conclude the whole of our work
II RELATED WORKS
Paulo M Mafra et al proposed an IDS based on Genetic and SVM algorithms that can improve SVM parameters [8]
In other words, the model uses Genetics as the optimization algorithm to maximize the performance of the SVM The average detection accuracy rate is recorded as 80.14% by this model Similarly, the authors in [9] proposed a similar model
of IDS using Genetics to optimize the parameters of SVM and also as a feature selection The functional suitability is developed for the Genetic algorithm in the paper that evaluated chromosomes with maximum accuracy and minimum number
of features [10] proposed a wrapper-based feature selection
Trang 2method as a multi-objective optimization algorithm and used
an unsupervised clustering method based on Growing
Hierar-chical Self-Organizing Maps (GHSOMSs) SOM is known as
one of the most used artificial neural networks unsupervised
learning models The study selects 25 features and shows an
accuracy rate of 99.12 ± 0.61% and an FP rate of 2.24 ±
0.41% The papers show an improvement over IDS models
with filter-based feature selection and IDS without feature
selection
The authors in [11] developed an IDS that uses a set
classifier for feature selection This framework combines the
bat algorithm (BA) and correlation-based feature selection
(CFS) In addition, the set classifier is built using Random
Forest and Forest-based Penalizing Attributes Test results
using dataset CIC-IDS2017 The test uses various performance
metrics, including false-positive, true-positive, detection,
ac-curacy, false alarm, and Matthews correlation coefficient The
results indicate that the combined CFSBA approach results in
high accuracy of 96.76%, a detection rate of 94.04%, and a
low false alarm level of 2.38%
Rikhtegar et al developed a feature selection model which
based on an associative learning mechanism [12] IDS
fea-ture selection and clustering mechanism They employed the
support vector machine (SVM) and K-Medoids clustering
algorithm in turn In addition, the approach also uses the
Naive Bayes classifier and uses the KDD CUP99 dataset
for evaluation The proposed model is evaluated using three
essential performance metrics: accuracy, detection rate, and
alarm rate The performance metrics are generated using
TNs (true negatives), FPs (false positives), and FNs (false
negatives) The test results are compared with three other
feature selection methods The test results show that the
pro-posed hybrid normalization method produces better accuracy
(91.5%), detection rate (90.1%), and false alarm rate (6.36%)
Shadi Aljawarneha et al developed a hybrid model used
to estimate the intrusion scope threshold degree based on the
network transaction data’s optimal features that were made
available for training [7] The experiments are tested on
the NSL-KDD datasets with significant results The accuracy
of the proposed modal was 99.81% for the binary class
and 98.56% for the multiclass However, this method has
some issues with high false and low false-negative rates,
which are addressed using a hybrid approach: Vote algorithm
with Information Gain and a hybrid algorithm consisting of
the following classifiers: J48, Meta Pagging, RandomTree,
REPTree, AdaBoostM1, DecisionStump, and naive Bayes The
final result is excellent with better accuracy, high false-negative
rate, and low false-positive rate
[13] proposed an IDS based on feature selection and
clus-tering algorithm using filter and wrapper methods - named
fea-ture grouping based on linear correlation coefficient (FGLCC)
algorithm and cuttlefish algorithm (CFA), respectively The
de-cision tree is used as the classifier in the proposed method For
performance verification, the proposed method was applied on
KDD Cup 99 large data sets The results are fascinating, with
high accuracy of 95.03% and a detection rate of 95.23% with
a low false-positive rate of 1.65%
III ENSEMBLE FEATURE SELECTION
The primary idea of our proposal is to combine the ad-vantages of both Random Forrest and Adaboost to maximize the feature selection If it is a large dataset with many noise samples, Ada Boost could be selected; otherwise, Random Forest is more suitable if we have a smaller, less noise dataset Comparing with deep-learning approach, our proposal achieves similar accuracy but significantly faster More detail about the algorithm can be described as below:
Proposed model’s algorithm Input: Dataset D
Ouput: Selected features set Sbest initialize: rounds R, Gbest[R], C[accuracy, running time] for n=1 to R: randomize F(F0, F1, , Fn-1)
RF = RandomForest(D, F),
AB = AdaBoost(D, F)
Gn = compare(RF and AB) Gbest.append(Gn)
C = cluster(Gbest) Sbest = MostFrequent(C[high accuracy, low running time]) return Sbest
Random Forest is a tree-based algorithm that consists of many decision trees to create an ensemble method The primary purpose is to optimize the accuracy much better than one Decision tree alone Each tree is built from different sub-contributes with different sub-datasets In this way, these trees protect each other from their own errors In detail, the algorithm can be described as below:
• Step 1: Select randomly n features from the original features set
• Step 2: Build Decision Tree based on n features selected, back to step 1
• Step 3: Repeating until we get a large number of the decision tree
• Step 4: Select randomly m decision tree in tree system With each test sample, each decision tree predicts which sample belongs to which class The final result is the side received most of the votes
Random forest is suitable as our dataset has many missing values The downside of this algorithm is performance since
it deals with a large number of the decision tree More about this algorithm can be described as below:
Trang 3Training set
Test set
Training Sample 1
Training Sample 2
Training Sample n
Decision Tree 1
Decision Tree 2
Decision Tree n
Voting
Prediction
AdaBoost is an ensemble algorithm designed to improve
accuracy by adjusting the weight attribute of each weak learner
(like a small decision tree) on the same dataset The prediction
uses a weighted majority vote from all weak learners in its
system The overall algorithm can be described as below:
• Step 1: Select and build weak learners (small decision
tree); initialize weight value for each tree is 1/N for N
trees
• Step 2: Select sub training set and start training
• Step 3: For those samples which were predicted right
from the last round, their weight will be decreased
• Step 4: Increase weight value from samples that were
predicted wrong
• Step 5: Back to step 2 with these new weight values
Repeating steps 3 and 4, Adaboost forces weak learners to
concentrate on those samples predicted wrong, improving
overall accuracy
Adaboost takes less time to implement The trade-off is that
its accuracy is not good as Random Forest, but we can use it
in some cases to maximize the performance of our algorithm
More details about Adaboost from below pseudo-code:
IV RESULTS ANDDISCUSSION
A Dataset
1) DAPRA KDDCUP99 dataset:
The DARPA dataset was originally developed in 1998,
aiming to improve research and survey in intrusion
detection KDDCUP99 is an upgraded version of the
DARPA 1999 dataset, used to develop an intrusion
detection system to distinguish between bad and good
connections The dataset is mainly designed to detect
in-trusions in the network through simulation in a military
environment KDDCUP99 is designed and developed
using DARPA98 IDS and is used to simulate four
The boosting algorithm AdaBoost Given: (x 1 , y 1 ), , (x m , y m )where x i ∈ X, y i ∈ {−1, +1}.
Initialize: D 1 (i) = 1/mfor i=1, ,m.
- Train weak learner using distribution D t
- Get weak hypothesis h t X → −1, +1.
- Aim: Select h t with low weighted error: ( t ) = P r(i D t )[h t (x i ) 6= y i ]
- Choose α t =12ln(1−t
t ).
- Update, for i=1, ,m: D(t + 1)(i) = Dt (i)exp(−αtyiht(xi))
Z t where Z t is a normalization factor(chosen so that D ( t + 1)will be a distribution) Output the final hypothesis:
H(x)=sign( P T
t=1 α t h t (x)) The process can be explained as follows:
Firstly:
Given a training set containing m samples where all x inputs are an element
of the total set X and where the output y is an element of a set consisting of only two values, -1 (negative class) and 1 ( positive class)
Next:
initialize all significant numbers of samples to 1 divided by the number of training samples
Next:
For the t = 1 to T classifier, fit it with the training data (in each prediction -1
or 1) and choose the classifier with the lowest number of error classifications.
different types of attacks These attacks can be classified into four main categories:
• Denial of Service Attacks (DoS): The attacker tries
to limit network usage by disrupting service avail-ability to intended users
• User to Root Attacks (U2R): This type of attack occurs when an attacker gains access from a normal user account and tries to gain root access through system vulnerabilities
• Remote to Local attacks (R2L): The attacker does not have an account on the local system but tries
to gain access through sending network packets to exploit vulnerabilities and gain access as a local user
• Probing attacks: This happens when an attacker scans a network to gather information about the system in order to use it to avoid system security control
TABLE I KDDCUP99 T RAINING D ATASET D ISTRIBUTION
# of instances Percentage % Normal 97,277 19.69%
Total 494,019 100%
2) NSL-KDD dataset: The NSL-KDD dataset is an im-proved version of KDDCUP99 It has been suggested to solve some of the problems of KDDCUP99 NSL-KDD has a fair amount of records in its training and testing set It has the same features as the original KDDCUP99
It is an effective standard by which researchers compare their proposed IDSs Some of the improvements of this dataset are:
Trang 4• There are no redundant records in the training set,
so the classifier will not produce any misleading
results
• There are no duplicate records in the testing set,
which results in a better Reduction ratio
• The number of selected records from each group
of difficulty levels is inversely proportional to the
percentage of records in the original KDD dataset
3) UNSW-NB15NB15 dataset:
The raw network packets of the UNSW-NB 15 Dataset
was created by the IXIA PerfectStorm tool in the Cyber
Range Lab of UNSW Canberra for generating a hybrid
of real modern normal activities and synthetic
contem-porary attack behaviors This data set has a hybrid of
the real modern normal, and the recent synthesized
attack activities of the network traffic include fuzzes,
analysis, backdoors, DoS, exploits, Generic,
Reconnais-sance, Shellcode, Worms There are 49 features that have
been developed using Argus, Bro-IDS tools, and twelve
algorithms that cover characteristics of network packets
In contrast, the existing benchmark data sets such as
KDD98, KDDCUP99, and NSLKDD, realized a limited
number of attacks and information of outdated packets
The tcpdump tool is used to capture network traffic in
the form of packets To capture 100 GBs, the simulation
period was 16 hours on Jan 22, 2015 and 15 hours on
Feb 17, 2015 Each pcap file is divided into 1000MB
using the tcpdump tool To create reliable features from
the pcap files, Argus6 and Bro-IDS7 tools are used
Twelve algorithms were developed using a C# language
to analyze the flows of the connection packets in-depth
B Resutls
In this section, we show the results when using our proposed
modal on three datasets, including NSL-KDD, UNSW-NB15
and CICIDS-2017
TABLE II
T HE COMPARISION RESULTS OF THE TEST MODAL RUNNING WITH ALL
# of features Loss Accuracy Time/step Time/epoch
TABLE III
T HE COMPARISION RESULTS OF THE TEST MODAL RUNNING WITH ALL
# of features Loss Accuracy Time/step Time/epoch
As shown in Table IV-B, IV-B, IV-B, our proposal could
improve the model performance (accuracy, loss and training
time) on all datasets For example, as with the dataset
CI-CIDS2017, the accuracy is stable, but the training time of the
TABLE IV
T HE COMPARISION RESULTS OF THE TEST MODAL RUNNING WITH ALL
# of features Loss Accuracy Time/step Time/epoch
algorithm is significantly reduced from 1600 seconds to 535 seconds In contrast, with the UNSW-NB15 dataset, although the training time did not decrease, the algorithm’s accuracy improved from 0.6924 to 0.7774 In the case of the NSLKDD dataset, the result has an improvement in both accuracy and training time
To demonstrate the effectiveness of our proposal over state-of-the-art methods, we compare our method with Sigmoid PIO and Cosine PIO [14], which are the newest feature selection methods As shown in Table IV-B and Table IV-B, our method
is more effective than the PIO in the KDDCUP99 dataset when only six features are selected, but the accuracy is still guaranteed Similar result in the NSLKDD dataset, the number
of features selected is more than the Cosine PIO method but less than the Sigmoid PIO, but the accuracy is still much higher than the two above methods There comparison results again demonstrated the effectiveness of our proposal
V CONCLUSION
In this paper, we introduced an ensemble feature selec-tion algorithm for IDS This algorithm aims at boosting the detection accuracy of IDS whereas reducing the number of networks features required to build the IDS model Evaluating NSL-KDD, UNSW-NB15, and CICIDS-2017 datasets shows that our proposal could reduce the number of features from
83 to 20, 43 to 14, and 40 to 14, respectively As a result, the model training time is significantly reduced Furthermore, the proposed algorithm is more effective than related works demonstrated by achieving better accuracy with fewer selected features
REFERENCES [1] R Morrar, H Arman, and S Mousa, “The fourth industrial revolution (industry 4.0): A social innovation perspective,” Technology Innovation Management Review, vol 7, no 11, pp 12–20, 2017.
[2] R P Singh, M Javaid, A Haleem, and R Suman, “Internet of things (iot) applications to fight against covid-19 pandemic,” Diabetes & Metabolic Syndrome: Clinical Research & Reviews, vol 14, no 4, pp 521–524, 2020.
[3] H S Brar and G Kumar, “Cybercrimes: A proposed taxonomy and challenges,” Journal of Computer Networks and Communications, vol.
2018, 2018.
[4] K Khan, A Mehmood, S Khan, M A Khan, Z Iqbal, and W K Mashwani, “A survey on intrusion detection and prevention in wireless ad-hoc networks,” Journal of Systems Architecture, vol 105, p 101701, 2020.
[5] A Aldweesh, A Derhab, and A Z Emam, “Deep learning approaches for anomaly-based intrusion detection systems: A survey, taxonomy, and open issues,” Knowledge-Based Systems, vol 189, p 105124, 2020 [6] S Maza and M Touahria, “Feature selection algorithms in intrusion detection system: A survey,” KSII Transactions on Internet and Infor-mation Systems (TIIS), vol 12, no 10, pp 5079–5099, 2018.
Trang 5TABLE V
T HE COMPARISION RESULTS WITH THE D ECISION T REES MODEL ON THE KDDCUP99 DATASET
Technique # of features Accuracy Selected features Sigmoid PIO 10 0.869 [3, 4, 6, 11, 13, 18, 23, 36, 37, 39]
Cosine PIO 7 0.883 [3, 4, 6, 13, 23, 29, 34]
TABLE VI
T HE COMPARISION RESULTS WITH THE D ECISION T REES MODEL ON THE NSLKDD DATASET
Technique # of features Accuracy Selected features
Sigmoid PIO 18 0.869 [1, 3, 4, 5, 6, 8, 10, 11, 12, 13, 14, 15, 17, 18, 27, 32, 36, 39, 41] Cosine PIO 5 0.883 [2, 6, 10, 22, 27]
Our method 14 0.980 [2, 3, 6, 8, 10, 11, 13, 23, 27, 30, 32, 35, 36, 39]
[7] S Aljawarneh, M Aldwairi, and M B Yassein, “Anomaly-based
in-trusion detection system through feature selection analysis and building
hybrid efficient model,” Journal of Computational Science, vol 25, pp.
152–160, 2018.
[8] P M Mafra, V Moll, J da Silva Fraga, and A O Santin,
“Octopus-iids: An anomaly based intelligent intrusion detection system,” in The
IEEE symposium on Computers and Communications IEEE, 2010, pp.
405–410.
[9] L Zhuo, J Zheng, X Li, F Wang, B Ai, and J Qian, “A genetic
algorithm based wrapper feature selection method for classification of
hyperspectral images using support vector machine,” in Geoinformatics
2008 and Joint Conference on GIS and Built Environment: Classification
of Remote Sensing Images, vol 7147 International Society for Optics
and Photonics, 2008, p 71471J.
[10] E De la Hoz, E De La Hoz, A Ortiz, J Ortega, and A
Mart´ınez-´
Alvarez, “Feature selection by multi-objective optimisation: Application
to network anomaly detection by hierarchical self-organising maps,”
Knowledge-Based Systems, vol 71, pp 322–338, 2014.
[11] Y Zhou, G Cheng, S Jiang, and M Dai, “Building an efficient intrusion
detection system based on feature selection and ensemble classifier,”
Computer networks, vol 174, p 107247, 2020.
[12] L Khalvati, M Keshtgary, and N Rikhtegar, “Intrusion detection based
on a novel hybrid learning approach,” Journal of AI and data mining,
vol 6, no 1, pp 157–162, 2018.
[13] S Mohammadi, H Mirvaziri, M Ghazizadeh-Ahsaee, and H
Karim-ipour, “Cyber intrusion detection by combined feature selection
algo-rithm,” Journal of information security and applications, vol 44, pp.
80–88, 2019.
[14] H Alazzam, A Sharieh, and K E Sabri, “A feature selection algorithm
for intrusion detection system based on pigeon inspired optimizer,”
Expert systems with applications, vol 148, p 113249, 2020.