Building models for detecting system attacts based on data mining

This paper proposes a new approach which combines different classifiers in order to make best use of each classifier. To build the new model, we evaluate the accuracy and performance (training and testing time) of three classification algorithms: ID3, Naitive Bayes and SVM.

Trang 1

This paper is available online at http://stdb.hnue.edu.vn

BUILDING MODELS FOR DETECTING SYSTEM ATTACTS

BASED ON DATA MINING

Pham Duy Trung1, Luong The Dung1 and Nguyen Duy Hai2

1Academy of Cryptography Techniques,

2Centre of Information Technology, Hanoi National University of Education

Abstract.With the development of the Internet, network security has become an

indispensable factor of computer technology Intrusion Detection Systems (IDS)

play an important role in network security One aspect which affects the accuracy

and performance of IDS are classifiers This paper proposes a new approach which

combines different classifiers in order to make best use of each classifier To

build the new model, we evaluate the accuracy and performance (training and

testing time) of three classification algorithms: ID3, Naitive Bayes and SVM Our

experimental results using the KDDCup’99 IDS dataset based on the 10-fold cross

validation test shows that against any one particular type of attack, one of the

classifiers functions best The purpose of this study is to enhance the accuracy

and performance of IDS against particular types of attacks

Keywords:Network security, data mining, network computer

1 Introduction

The Internet pervades almost every aspect of life and business and, due to the exponential growth of this trend, there has come to exist the critical need to secure these systems from unauthorized disclosure, transfer, modification or destruction An Intrusion Detection System (IDS) inspects the activities in a system for suspicious behavior or patterns that may indicate an ongoing system attack or misuse Recently, as networks have become faster, the need has an emerged for security analysis techniques that will be able

to keep up with the increased network throughput [1] Due to large volumes of security audit data as well as complex and dynamic properties of intrusion behaviors, optimizing

Received May 25, 2013 Accepted June 30, 2013.

Contact Nguyen Duy Hai, e-mail address: haind@hnue.edu.vn

Trang 2

the performance of IDS becomes an important, open problem that receives more attention from the research community [2]

Besides expert systems, state transition analysis and statistical analysis, data mining has become a popular technique for detecting intrusion [3] The main reason for using Data Mining Techniques for IDS is that it is capable of handling the enormous volume

of existing and newly appearing network data that require processing One of the most important Data Mining Techniques for Intrusion Detection is classification Classification models can be built using a wide variety of algorithms which can be classified into three types: extensions to linear discrimination (e.g., multiplayer perceptron and logistic discrimination), decision tree and rule-based methods (e.g., C4.5 or J.48, AQ and CART) and density estimators (Na¨ıve Bayes and k-nearest neighbor, LVQ) [4] A search of the literature shows that a 3-level classification model with C4.5 algorithm provides a DOS

detection rate of almost 100% [5] Rung Chin Cheng et al [6] proposed an intrusion

detection method using SVM based on a RST They show that an accuracy of 86.79% could be achieved using 41 features, while using a rough set increased the accuracy by 89.13%

No data mining algorithms for intrusion detection has been identified as being the best Furthermore, it should be noted that once IDS are more widely used, new properties will have to be taken into consideration, such as large volumes of security audit data and complex and dynamic properties of intrusion behavior One difficulty encountered in such

a study concerns the lack of published objective comparisons between classifiers Ideally, classifiers should be tested within the same context, i.e., with the same dataset and using the same features extraction method Currently, this is a crucial problem for IDS research based on data mining

In this paper, we evaluated three data mining algorithms for intrusion detection, Na¨ıve Bayes, J48 and Support Vector Machine (SVM), based on data mining structure for IDS In addition, we propose a new approach which combines different classifiers in order to make best use of each classifier The purpose of our research is to enhance the accuracy and performance of IDS against particular types of attacks

2.1 The data mining model for IDS

In recent years, there has been an increase in the use of data mining-based approaches to build intrusion detection models Our intrusion detection models can be built in five steps The process starts with an initial set of network audit data The data are preprocessed, and then the optimal set of features will be obtained by feature extraction and feature selection stages before classification

Trang 3

Systems that construct classifiers are commonly used tools in data mining Such systems take a collection of cases as input, each belonging to a small number of classes described by a fixed set of attributes and output a classifier that can accurately predict the class to which a new case belongs

Network Audit

↓ Data Preprocess

↓ Feature Extraction

↓ Feature Selection

↓ Classification

Figure 1 Intrusion detection model based on data mining

2.2 Experiment

2.2.1 Dataset

The KDD Cup 1999 dataset [7] was derived from the 1998 DARPA Intrusion detection Evaluation program prepared and managed by the MIT Lincoln Laboratory The dataset was a collection of simulated raw TCP dump data collected over a period

of nine weeks The simulated attacks were classified according to the actions and goals

of the attacker The dataset consists of one type of normal data and 22 different attack types categorized into 4 classes: Denial of Service (DoS), Probe, User–to–Root (U2R) and Remote–to–Login (R2L)

Denials of Service (DoS) attacks have the goal of limiting or denying services provided to the user, computer or network A common tactic is to severely overload the targeted system Probing or Surveillance attacks have the goal of gaining knowledge of the existence or configuration of a computer system or network Port Scans or sweeping

of a given IP address range typically fall into this category

User-to-Root (U2R) attacks have the goal of gaining root or super-user access

on a particular computer or system on which the attacker previously had user level access These are attempts by a non-privileged user to gain administrative privileges A Remote-to-Local (R2L) attack is an attack in which a user sends packets to a machine which the user does not have access to in order to expose the machine’s vulnerabilities and exploit privileges which a local user would have on the computer

The details of attacks of labeled records are given in Table 1

Trang 4

Table 1 Attack classification

DOS Neptune, Smurf, Pod, Teardrop, Land, back, mailbomb,

processtable, udpstorm

Probe portsweep, IPsweep, nmap, mscan

U2R buffer_overflow, loadmodule, perl, rootkit, httprunnel, ps,

sqlattack, xterm R2L

guess_password, ftpwirte, Imap, multihop, named, phf, sendmail, snmpgetattack, snmpguess, spy, warezclient, warezmaster, worm, xlock, xsnoop

10% of the overall KDD Cup 1999 labeled dataset which contains 494,020 records having 41 features The distribution of connections types is given in the Table 2

Table 2 Distribution of connection types in the KDD CUP’99 Training Dataset

Class Number of instances Percentage of occurrence

Due to the large number of data in the dataset, duplicate instances are removed and selected at random and a sample of 10% normal data, 10% Neptune attack in DoS class and the other data remained

2.2.2 Feature selection

Feature selection includes the basic features of an individual TCP connection such

as duration, protocol type, number of bytes transferred and the flag indicating the normal

or error status of the connection Other features of an individual connection were obtained using some domain knowledge, and include the number of file creation operations and number of failed login attempts In total, there were 41 features, most of them taking on continuous values as in Table 3

Trang 5

Table 3 KDD cup’99 feature

11 num_failed_logins 32 dst_host_count

13 num_compromised 34 dst_host_same_srv_rate

14 root_shell 35 dst_host_diff_srv_rate

15 su_attempted 36 dst_host_same_srv_port_rate

15 num_root 37 dst_host_srv_diff_host_rate

17 num_file_creations 38 dst_host_serror_rate

18 num_shells 39 dst_host_srv_serror_rate

19 num_access_files 40 dst_host_rerror_rate

20 num_outbound_cmd 41 dst_host_srv_rerror_rate

21 is_host_login

2.3 Results and discussions

The three techniques of SVM using Radial Kernel, Native Bayes and J48 to build intrusion detection models were obtain from WEKA [8] The Radial Kernel and Neural Kernel were selected for the SVM technique We choose those settings to obtain the highest performance for those techniques In our experiments, 10-fold cross validation was used to have intrusion detection rates for the three techniques

When comparing with the accuracy of the multi-class classifier and the two-class classifier used with ID3 and Na¨ıve Bayes, it can be seen that the two-class classifier

Trang 6

has better results based on accuracy criteria Figure 2 indicates that the decision tree produces better accuracy for Probe, R2L and U2R compared to SVM and Naitive Bayes It’s accuracy is lower than SVM but higher than with Naitive Bayes for DOS with a small dataset Therefore, SVM is not suitable with such a small dataset This finding is

consistent with the studies of Mohammad Reza Ektefa et al [9] which showed that C4.5

algorithms performed better than SVM in detecting network intrusions and regarding false alarms

Figure 2 Comparing the accuracy

of the three algorithms

Figure 3 Comparing the model building time of the three algorithms

In Figure 3, Na¨ıve Bayes has the best training time, while for SVM the training time

is much higher than for the others Figure 4 shows that the test time of decision trees is much better than the others, thus the use of decision tree classifier systems for intrusion detection will enhance system performance significantly

Figure 4 Comparing the model testing time of the three algorithms

2.4 Attack classification method based on combined classifiers

From the experimental results, we can provide an integrated model to select efficient algorithms for each specific type of attack Observing the chart and table, we can

Trang 7

see that a classification model can give better results than the other models for a certain type of attack, so each best algorithm should be selected for some specific types of attack Therefore, assuming that the IDS system is integrated from several different classifiers and able to perform in parallel with n processors at the same time, each processor will run

a classification algorithm (Classifier) The attack class of each new access to the system (new record) can be selected by the voting algorithm for classifiers The algorithm is presented in Figure 5

Input:- New record: r

- n of classification algorithms: CF1, , CFn

- Processors: P0, , Pn

Output:C(Class of new record )

Begin

For i= 1 to n, each Pido

Begin

C[i] := CFi(r); Send (C[i], P0);

End

If (tid == 0) then P do

Begin

Class[1] := C[1];

Count[1] : = 1;

For i = 2 to n− 1

If (C[i] = class[k]) Count[k] = Count[k] + 1; Else Begin

k= k + 1;

Class[k] = C[k];

Count[k] = 1;

End For i= 1 to k − 1

If (maxd < Count[i]) Begin

maxd= Count[i];

C = Class[i];

End End

Ouput C;

End

Figure 5 Attack classification model based combined classifiers

Trang 8

3 Conclusion

The paper proposed a new approach which combines different classifiers in order to make best use of each classifier To build the new model, we evaluated the accuracy and performance (training and testing time) of three classification algorithms: ID3, Naitive Bayes and SVM Our experimental results using the KDDCup’99 IDS dataset based on the 10-fold cross validation test show that each classifier functions best for each particular type of attack

REFERENCES

[1] Christopher Kruegel, Fredrik Valeur, Giovanni Vigna and Richard A Kemmerer,

2002 Stateful Intrusion Detection for High-Speed Networks In IEEE Symposium

on Security and Privacy, IEEE Computer Society Press, USA

[2] Nguyen, H & Choi, D 2008 Application of data mining to network intrusion

detection:classifier selection model Sprnger-Verlag Berlin Heidelberg, pp 399-408 [3] Lu, C.-T., Boedihardjo, A.P., Manalwar, P., 2005 Exploiting efficient data

mining techniques to enhance intrusion detection systems Information Reuse and Integration, Conf, 2005 IRI-2005 IEEE International Conference, pp 512-517 [4] Henery R J., 1994 Classification Machine Learning Neural and Statistical

Classification

[5] C Xiang; M.Y Chong; H.L Zhu 2004 Design of mnitiple-level tree classifiers for

intrusion detection system Cybernetics and Intelligent Systems, IEEE Conference, Vol 2, pp 873-878

[6] Rung-Ching Chen, Kai-Fan Cheng, Ying-Hao Chen, Chia-Fen Hsieh, 2009

Using Rough Set and Support Vector Machine for Network Intrusion Detection System Intelligent Information and Database Systems ACIIDS 2009 First Asian Conference, pp 465-470

[7] KDD99: http://kdd.ics.uci.edu/databases/kddcup99/10 percent.gz

[8] WEKA: http://sourceforge.net/projects/weka/

[9] Mohammadreza Ektefa, Sara Memar, Fatimah Sidi, Lilly Suriani Affendey, 2010

Intrusion Detection Using Data Mining Techniques.Proc.of IEEE Intl Conference

on Information Retrieval & Knowledge Management, pp 200-203

Định dạng
Số trang	8
Dung lượng	223,71 KB