This paper proposes a new approach which combines different classifiers in order to make best use of each classifier. To build the new model, we evaluate the accuracy and performance (training and testing time) of three classification algorithms: ID3, Naitive Bayes and SVM.
Trang 1This paper is available online at http://stdb.hnue.edu.vn
BUILDING MODELS FOR DETECTING SYSTEM ATTACTS
BASED ON DATA MINING
Pham Duy Trung1, Luong The Dung1 and Nguyen Duy Hai2
1Academy of Cryptography Techniques,
2Centre of Information Technology, Hanoi National University of Education
Abstract.With the development of the Internet, network security has become an
indispensable factor of computer technology Intrusion Detection Systems (IDS)
play an important role in network security One aspect which affects the accuracy
and performance of IDS are classifiers This paper proposes a new approach which
combines different classifiers in order to make best use of each classifier To
build the new model, we evaluate the accuracy and performance (training and
testing time) of three classification algorithms: ID3, Naitive Bayes and SVM Our
experimental results using the KDDCup’99 IDS dataset based on the 10-fold cross
validation test shows that against any one particular type of attack, one of the
classifiers functions best The purpose of this study is to enhance the accuracy
and performance of IDS against particular types of attacks
Keywords:Network security, data mining, network computer
1 Introduction
The Internet pervades almost every aspect of life and business and, due to the exponential growth of this trend, there has come to exist the critical need to secure these systems from unauthorized disclosure, transfer, modification or destruction An Intrusion Detection System (IDS) inspects the activities in a system for suspicious behavior or patterns that may indicate an ongoing system attack or misuse Recently, as networks have become faster, the need has an emerged for security analysis techniques that will be able
to keep up with the increased network throughput [1] Due to large volumes of security audit data as well as complex and dynamic properties of intrusion behaviors, optimizing
Received May 25, 2013 Accepted June 30, 2013.
Contact Nguyen Duy Hai, e-mail address: haind@hnue.edu.vn
Trang 2the performance of IDS becomes an important, open problem that receives more attention from the research community [2]
Besides expert systems, state transition analysis and statistical analysis, data mining has become a popular technique for detecting intrusion [3] The main reason for using Data Mining Techniques for IDS is that it is capable of handling the enormous volume
of existing and newly appearing network data that require processing One of the most important Data Mining Techniques for Intrusion Detection is classification Classification models can be built using a wide variety of algorithms which can be classified into three types: extensions to linear discrimination (e.g., multiplayer perceptron and logistic discrimination), decision tree and rule-based methods (e.g., C4.5 or J.48, AQ and CART) and density estimators (Na¨ıve Bayes and k-nearest neighbor, LVQ) [4] A search of the literature shows that a 3-level classification model with C4.5 algorithm provides a DOS
detection rate of almost 100% [5] Rung Chin Cheng et al [6] proposed an intrusion
detection method using SVM based on a RST They show that an accuracy of 86.79% could be achieved using 41 features, while using a rough set increased the accuracy by 89.13%
No data mining algorithms for intrusion detection has been identified as being the best Furthermore, it should be noted that once IDS are more widely used, new properties will have to be taken into consideration, such as large volumes of security audit data and complex and dynamic properties of intrusion behavior One difficulty encountered in such
a study concerns the lack of published objective comparisons between classifiers Ideally, classifiers should be tested within the same context, i.e., with the same dataset and using the same features extraction method Currently, this is a crucial problem for IDS research based on data mining
In this paper, we evaluated three data mining algorithms for intrusion detection, Na¨ıve Bayes, J48 and Support Vector Machine (SVM), based on data mining structure for IDS In addition, we propose a new approach which combines different classifiers in order to make best use of each classifier The purpose of our research is to enhance the accuracy and performance of IDS against particular types of attacks
2.1 The data mining model for IDS
In recent years, there has been an increase in the use of data mining-based approaches to build intrusion detection models Our intrusion detection models can be built in five steps The process starts with an initial set of network audit data The data are preprocessed, and then the optimal set of features will be obtained by feature extraction and feature selection stages before classification
Trang 3Systems that construct classifiers are commonly used tools in data mining Such systems take a collection of cases as input, each belonging to a small number of classes described by a fixed set of attributes and output a classifier that can accurately predict the class to which a new case belongs
Network Audit
↓ Data Preprocess
↓ Feature Extraction
↓ Feature Selection
↓ Classification
Figure 1 Intrusion detection model based on data mining
2.2 Experiment
2.2.1 Dataset
The KDD Cup 1999 dataset [7] was derived from the 1998 DARPA Intrusion detection Evaluation program prepared and managed by the MIT Lincoln Laboratory The dataset was a collection of simulated raw TCP dump data collected over a period
of nine weeks The simulated attacks were classified according to the actions and goals
of the attacker The dataset consists of one type of normal data and 22 different attack types categorized into 4 classes: Denial of Service (DoS), Probe, User–to–Root (U2R) and Remote–to–Login (R2L)
Denials of Service (DoS) attacks have the goal of limiting or denying services provided to the user, computer or network A common tactic is to severely overload the targeted system Probing or Surveillance attacks have the goal of gaining knowledge of the existence or configuration of a computer system or network Port Scans or sweeping
of a given IP address range typically fall into this category
User-to-Root (U2R) attacks have the goal of gaining root or super-user access
on a particular computer or system on which the attacker previously had user level access These are attempts by a non-privileged user to gain administrative privileges A Remote-to-Local (R2L) attack is an attack in which a user sends packets to a machine which the user does not have access to in order to expose the machine’s vulnerabilities and exploit privileges which a local user would have on the computer
The details of attacks of labeled records are given in Table 1
Trang 4Table 1 Attack classification
DOS Neptune, Smurf, Pod, Teardrop, Land, back, mailbomb,
processtable, udpstorm
Probe portsweep, IPsweep, nmap, mscan
U2R buffer_overflow, loadmodule, perl, rootkit, httprunnel, ps,
sqlattack, xterm R2L
guess_password, ftpwirte, Imap, multihop, named, phf, sendmail, snmpgetattack, snmpguess, spy, warezclient, warezmaster, worm, xlock, xsnoop
10% of the overall KDD Cup 1999 labeled dataset which contains 494,020 records having 41 features The distribution of connections types is given in the Table 2
Table 2 Distribution of connection types in the KDD CUP’99 Training Dataset
Class Number of instances Percentage of occurrence
Due to the large number of data in the dataset, duplicate instances are removed and selected at random and a sample of 10% normal data, 10% Neptune attack in DoS class and the other data remained
2.2.2 Feature selection
Feature selection includes the basic features of an individual TCP connection such
as duration, protocol type, number of bytes transferred and the flag indicating the normal
or error status of the connection Other features of an individual connection were obtained using some domain knowledge, and include the number of file creation operations and number of failed login attempts In total, there were 41 features, most of them taking on continuous values as in Table 3
Trang 5Table 3 KDD cup’99 feature
11 num_failed_logins 32 dst_host_count
13 num_compromised 34 dst_host_same_srv_rate
14 root_shell 35 dst_host_diff_srv_rate
15 su_attempted 36 dst_host_same_srv_port_rate
15 num_root 37 dst_host_srv_diff_host_rate
17 num_file_creations 38 dst_host_serror_rate
18 num_shells 39 dst_host_srv_serror_rate
19 num_access_files 40 dst_host_rerror_rate
20 num_outbound_cmd 41 dst_host_srv_rerror_rate
21 is_host_login
2.3 Results and discussions
The three techniques of SVM using Radial Kernel, Native Bayes and J48 to build intrusion detection models were obtain from WEKA [8] The Radial Kernel and Neural Kernel were selected for the SVM technique We choose those settings to obtain the highest performance for those techniques In our experiments, 10-fold cross validation was used to have intrusion detection rates for the three techniques
When comparing with the accuracy of the multi-class classifier and the two-class classifier used with ID3 and Na¨ıve Bayes, it can be seen that the two-class classifier
Trang 6has better results based on accuracy criteria Figure 2 indicates that the decision tree produces better accuracy for Probe, R2L and U2R compared to SVM and Naitive Bayes It’s accuracy is lower than SVM but higher than with Naitive Bayes for DOS with a small dataset Therefore, SVM is not suitable with such a small dataset This finding is
consistent with the studies of Mohammad Reza Ektefa et al [9] which showed that C4.5
algorithms performed better than SVM in detecting network intrusions and regarding false alarms
Figure 2 Comparing the accuracy
of the three algorithms
Figure 3 Comparing the model building time of the three algorithms
In Figure 3, Na¨ıve Bayes has the best training time, while for SVM the training time
is much higher than for the others Figure 4 shows that the test time of decision trees is much better than the others, thus the use of decision tree classifier systems for intrusion detection will enhance system performance significantly
Figure 4 Comparing the model testing time of the three algorithms
2.4 Attack classification method based on combined classifiers
From the experimental results, we can provide an integrated model to select efficient algorithms for each specific type of attack Observing the chart and table, we can
Trang 7see that a classification model can give better results than the other models for a certain type of attack, so each best algorithm should be selected for some specific types of attack Therefore, assuming that the IDS system is integrated from several different classifiers and able to perform in parallel with n processors at the same time, each processor will run
a classification algorithm (Classifier) The attack class of each new access to the system (new record) can be selected by the voting algorithm for classifiers The algorithm is presented in Figure 5
Input:- New record: r
- n of classification algorithms: CF1, , CFn
- Processors: P0, , Pn
Output:C(Class of new record )
Begin
For i= 1 to n, each Pido
Begin
C[i] := CFi(r); Send (C[i], P0);
End
If (tid == 0) then P do
Begin
Class[1] := C[1];
Count[1] : = 1;
For i = 2 to n− 1
If (C[i] = class[k]) Count[k] = Count[k] + 1; Else Begin
k= k + 1;
Class[k] = C[k];
Count[k] = 1;
End For i= 1 to k − 1
If (maxd < Count[i]) Begin
maxd= Count[i];
C = Class[i];
End End
Ouput C;
End
Figure 5 Attack classification model based combined classifiers
Trang 83 Conclusion
The paper proposed a new approach which combines different classifiers in order to make best use of each classifier To build the new model, we evaluated the accuracy and performance (training and testing time) of three classification algorithms: ID3, Naitive Bayes and SVM Our experimental results using the KDDCup’99 IDS dataset based on the 10-fold cross validation test show that each classifier functions best for each particular type of attack
REFERENCES
[1] Christopher Kruegel, Fredrik Valeur, Giovanni Vigna and Richard A Kemmerer,
2002 Stateful Intrusion Detection for High-Speed Networks In IEEE Symposium
on Security and Privacy, IEEE Computer Society Press, USA
[2] Nguyen, H & Choi, D 2008 Application of data mining to network intrusion
detection:classifier selection model Sprnger-Verlag Berlin Heidelberg, pp 399-408 [3] Lu, C.-T., Boedihardjo, A.P., Manalwar, P., 2005 Exploiting efficient data
mining techniques to enhance intrusion detection systems Information Reuse and Integration, Conf, 2005 IRI-2005 IEEE International Conference, pp 512-517 [4] Henery R J., 1994 Classification Machine Learning Neural and Statistical
Classification
[5] C Xiang; M.Y Chong; H.L Zhu 2004 Design of mnitiple-level tree classifiers for
intrusion detection system Cybernetics and Intelligent Systems, IEEE Conference, Vol 2, pp 873-878
[6] Rung-Ching Chen, Kai-Fan Cheng, Ying-Hao Chen, Chia-Fen Hsieh, 2009
Using Rough Set and Support Vector Machine for Network Intrusion Detection System Intelligent Information and Database Systems ACIIDS 2009 First Asian Conference, pp 465-470
[7] KDD99: http://kdd.ics.uci.edu/databases/kddcup99/10 percent.gz
[8] WEKA: http://sourceforge.net/projects/weka/
[9] Mohammadreza Ektefa, Sara Memar, Fatimah Sidi, Lilly Suriani Affendey, 2010
Intrusion Detection Using Data Mining Techniques.Proc.of IEEE Intl Conference
on Information Retrieval & Knowledge Management, pp 200-203