Using null data processing to recognize variant computer viruses for rule based anti virus systems

Using a null data processing tool such as data fusion and a rule-based production inference, the process ‘deNull’ tries to estimate all possible values for these elements.. Data cleaning

Trang 1

Abstract—Null data processing is commonly used for

Knowledge Discovering in Database systems and Machine

Learning systems In fact, null data must be eliminated from

major databases because it often causes bad effects on the quality

of data mining processing In this paper, we introduce a new

point of view: null data processing can also be used in a

recognition system to identify strange objects Our work involves

three operations - ‘toNull’, ‘deNull’ and ‘fixNull’ along with Data

Fusion technique and a rule-based production inference

algorithm This process has been applied to our designed

software called Machine Learning Anti-virus System (MAV) in

security scope The results from the experiment showed that the

MAV-operated algorithm has the same performance as other

anti-virus software whose algorithms require bigger viruses

signature database

Keywords— Data fusion, Machine learning, Knowledge based

systems, Security

I INTRODUCTION ETRIEVING valued information from database of

Machine Learning (ML) systems and Knowledge

Discovering in Database (KDD) systems (also called target

systems) depends on many factors, especially on the quality of

the sample data set A large number of data sources sometimes

have no data in some fields and on several records Before

applying data mining process, target systems have to clean the

input data One of the techniques to clean data is using a null

data process to recall possible values In this paper, we

introduce a new point of view: null data processing can be

used in a recognition system to identify strange objects Our

work consists of three steps At first, we set up a process

called ‘toNull’ which empties all real null values and

abnormal values of the objects’ attributes Using a null data

processing tool such as data fusion and a rule-based

production inference, the process ‘deNull’ tries to estimate all

possible values for these elements Finally, a process named

‘fixNull’ will recall the original values for all modified

elements and add them into the knowledge base

In the next section - Experimentations, we will illustrate the

exciting results of this technique on a rule-based recognition

target system: Machine Learning Anti-virus System (MAV)

1 Cantho Inservice University, Vietnam (e-mail: tmnquang@ctu.edu.vn)

2

Vietnam National University HCM City (e-mail: hkiem@citd.edu.vn)

3 Hanoi University of Technology, Vietnam (e-mail: thuynt@it-hut.edu.vn)

II NULL DATA PROCESSING

A Data cleaning in the knowledge discovery process

The knowledge discovery process consists of the following: data selection, cleaning, enrichment, coding, data mining and reporting Cleaning is an important phase because the quality

of data directly affects the quality of data mining processing The aim of this phase is to treat all data pollution There are two types of data pollution: duplication of records and lack of domain consistency [1]

B Null data processing in the cleaning phase

Besides several high quality data sources, some data are collected from many sites at various qualities Because of many reasons, lack of domain consistency is a common problem that target systems have to face This type of pollution is particularly damaging because it is hard to trace, but it will greatly influence the type of the data patterns [1]

C Null data processing with Data Fusion

There are many existing null data processes In our research, we use data fusion as a robust partner Data fusion originates from market studies [2],[3] especially in media and consumption surveys, where it is often impossible to ask the same sample all the items when there are too many questions

In data fusion, the goal is to obtain a single database where all the variables have been completed for the union of units Basically the problem may be formalized in terms of two data

files: the first file contains observations for a whole set of p+q variables measured on n 0 units; the second file contains

observations of only a subset of p variables for n 1 units In

some cases, n 0 is small compared to n 1 If X stands for the

common variables, we have the scheme in Fig 1 The problem here is to fill in the blank part of the table, where a lot of variables are missing because they have not been collected [4]

In Fig 2, the file (X 0 ,Y 0 ) is used to predict the unknown Y

part of the second file The first file will be called donor file and the second one the recipient In this approach, imputation with implicit models are based on the principle of copying and

pasting; we give the whole vector of variables of the donor X

to the Y variables of the receiver

Using Null Data Processing to recognize variant computer viruses for

Rule-based Anti-virus systems

R

Fig 1: Missing value in database

Trang 2

TABLE II

T WO OF THE D IAGNOSED O BJECTS IN O BSERVATION S PACE

No Object Name a 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9

v ObjectV 15 28 03 101 38 27 65 37 90 61

x ObjectX 15 28 03 101 39 27 65 37 91 61

Let i be a receiver The basic idea is to look for a donor j

having a close profile with the X variables: a double if all the

variables are identical or the nearest neighbor such as an

appropriate distance d(i,j) in the R p space of common

variables is minimal [4]

III RULE-BASED ANTI-VIRUS SYSTEMS

A Rule-based Anti-virus systems

With the rapid development of the Internet, computer

viruses have been hot news They are more frequently and

more seriously infecting, destroying and stealing data, which

directly influences the network security and data safety of

many computer systems in the world There are many types of

computer viruses, and each type has its own way of infecting

[5] Basically, scanning for computer viruses is a recognition

process for all characteristics of viral codes in an ID-virus

library To diagnose computer viruses, a rule-based anti-virus

program needs to be built with the set VK of K vectors-virus

signatures:

VK= {v1, v2,…, vk} and to determine the existence of vi in the diagnosed set S

The conventional rule set has the form:

where pi represents the virus signature/behavior and q is a

result/conclusion of the process

When a new virus appears, anti-virus experts debug it and

update correctly the viral signatures vi for ID-virus library

Using the information of the diagnosed object in observation

space and of the viral characteristic in ID-virus library, the

data classification algorithm of anti-virus program will assign

the diagnosed object into Class 1 or Class 2 - possibly or

impossibly infected by a virus [6]

Suppose that anti-virus AV has an ID-virus library (Table I)

and works on the observation space where there are two

special objects that we pay attention to: ObjectV and ObjectX

(Table II) The scanning result shows that ObjectV is infected

by virus Family.f.vir, and ObjectX is safe

B Machine Learning approach to anti-virus system

The advantage of the formula above is clear: the anti-virus program can identify most known viruses from data test However, the searching algorithm will fail when lacking the data sources (e.g virus signatures in ID-virus library)[7] For example, virus V may change to X, XN = {x1, x2,…, xn} When we apply the rule (1) to diagnose X, the result is evident: ¬QV; for at least one xu exists (xu≠vu, u = 1÷n) Therefore, most anti-viruses cannot recognize any variant viruses [8]

Our solution is to create a conventional anti-virus program that utilizes a ML rule-based approach that involves these basic tasks: modeling a knowledge base, forming rule sets to recognize known viruses, diagnosing and discovering interesting rules using Data Mining algorithms, finding hidden attributes and predicting variant/unknown viruses [6] Like a typical KDD, MAV has these stages:

- Examining: Data selection, Cleaning, Enrichment

- Diagnosing: Coding, Data mining

- Treatment: Data processing

- Conclusion: Reporting The technique referred to in this paper is part of the first stage of our ML system, the data cleaning process

Analyzing virus characteristics, we defined 5 virus classes corresponding to widespread viruses such as File-viruses, Boot-viruses, Worm-viruses, Macro-viruses and Text-viruses Depending on the diagnostic purpose, a class is defined with its particular characteristics In general, a standard virus class has an object-oriented form:

Object: Virus family identification Property: Attributes/behavior Method: Treatment/direction outline

X 0 Y 0

X 1 ?

I

J

DONOR FILE

RECIPIENT FILE

Fig 2: Imputation scheme

TABLE I

E XAMPLE OF A V IRUS S IGNATURE D ATABASE

No Virus Name v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9

1 Family.a.vir 15 28 03 101 32 27 65 37 81 61

2 Family.b.vir 15 28 03 101 35 27 65 37 85 61

3 Family.c.vir 15 28 03 101 30 27 65 37 90 61

4 Family.d.vir 15 28 03 101 34 27 65 37 84 61

5 Family.e.vir 15 28 03 101 33 27 65 37 83 61

6 Family.f.vir 15 28 03 101 38 27 65 37 90 61

7 Family.g.vir 15 28 03 101 30 27 65 37 88 61

8 Family.h.vir 15 28 03 101 29 27 65 37 87 61

9 Family.i.vir 15 28 03 101 31 27 65 37 92 61

Trang 3

50

100

150

200

16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256

(a) Executable code sequence of virus Klez.a.worm.W32

o 256

0

50

100

150

200

256

16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256

(b) Executable code sequence of virus Klez.h.worm.W32

Fig 3: The graphical executable code of virus family Klez.worm.W32

Using this model, we suggest three knowledge bases:

- KB1: examine determined diseases

- KB2: diagnose new determined diseases

- KB3: diagnose unusual diseases

When KB1 is used to treat most known viruses, KB2 is

used to detect some variances of known viruses, and KB3 is

used to predict unknown viruses In this assignment, MAV

also requires the decisions in different usage levels: end-users,

technicians and system administrators [6]

C Cleaning data to recognize variant computer viruses

Most anti-viruses are ineffective when they are facing

variant viruses In fact, virus variants are programmed from

their parent’s source codes by crackers Therefore, many

target codes in the same virus family are similar In Fig 3, the

X-axis denotes the position of each executable code; the

Y-axis denotes its correspondent value (0-255 values) Chart 3a

shows a graphical executable code of virus Klez.a.worm.W32,

and chart 3b shows the code of virus Klez.h.worm.W32, which

is the seventh descendant of Klez.a.worm.W32

If the amount of variant set xi of X is smaller than vi of V,

the recognition rule to identify the variant computer viruses

can be defined as follow:

RX: a1^ a2 ^… ^ (au ← NULL) ^…^ an → QX

where: ai denotes values of virus V

au denotes values abnormal compared to V

QX denotes a conclusion of the inference process to

assert whether X is a descendant of V

The process to recognize variant computer viruses involves

three operations:

1) toNull: to create a data backup and then empty all

abnormal values of the objects’ attributes to isolate all new viral behaviors (Fig.4) After this

operation, there will be a ‘virtual null’ data set

(null but not null)

2) deNull: using data fusion to replace all null data by

possible values The goal of deNull operation is

to predict all strange objects that may be a nearest neighbour of known viruses (Fig 5) 3) fixNull: to recall all original values from data backup

(Fig 6) and add them into the database to recognize the ‘variant of variant’ computer viruses in the future (Table III)

IV EXPERIMENTATIONS

A Practical processing

To estimate the effect of the method above, we have experimented with Norton Anti-virus 2003 Professional Edition, Virus Scan Professional Edition, Bit Defender v.8 and MAV, which have a smart scanning ability (Table IV)

The system activities are described as follow:

• Merging the datatset of virus samples randomly into the observation space

• Installing 4 anti-virus programs into the testing system

• For each testing anti-virus program:

- To disable the auto-protect agents

- To enable the smart scanning option at highest level

(a) ObjectX before nulling: 15 28 03 101 39 27 65 37 91 61

(b) ObjectX after nulling: 15 28 03 101 ? 27 65 37 ? 61

Fig 4: toNulling a diagnosing object

1 Family.a.vir 15 28 03 101 32 27 65 37 81 61

2 Family.b.vir 15 28 03 101 35 27 65 37 85 61

3 Family.c.vir 15 28 03 101 30 27 65 37 90 61

4 Family.d.vir 15 28 03 101 34 27 65 37 84 61

5 Family.e.vir 15 28 03 101 33 27 65 37 83 61

6 Family.f.vir 15 28 03 101 38 27 65 37 90 61

7 Family.g.vir 15 28 03 101 30 27 65 37 88 61

8 Family.h.vir 15 28 03 101 29 27 65 37 87 61

9 Family.i.vir 15 28 03 101 31 27 65 37 92 61

Nearest neighbour (NN): 38 NN: 90 (a) Using the nearest neighbour to estimate possible values ObjectX*: 15 28 03 101 38 27 65 37 90 61

(b) A variant of virus Family.f.vir detected

Fig 5: deNulling all abnormal values

New variant virus: 15 28 03 101 39 27 65 37 91 61

Fig 6: Recalling original object values

Trang 4

0 200 400 600 800 1000 1200

Fig 7: Comparison of smart Anti-virus scannings

- To setup the program activities only for virus

detection

- To update the last virus signature database from

program’s website

- To scan the observation space for viruses

- To take the results of each testing program

B Experimentation Results

Table V is the diagnostic result of 80,000 KB dataset of

1,000 virus samples supplied by Kaspersky Anti-virus

(Version 5.0.5/Release build #13, compiled at Nov 29 2004)

Fig 7 is the chart of the anti-virus test results With respect

to detection, MAV and Bit Defender (BitDef) have the same

results at 957 and 959 viruses detected Both anti-viruses are

better than Norton Anti-virus (NAV) and Virus Scan at 907

and 906

Although the number of updated viruses is very small (at

890 virus signatures), MAV can recognize as many computer

viruses as other anti-virus programs, which have a much

bigger virus signature database (of 72,020 and 253,993 virus

signatures)

V CONCLUSION Using a new point of view which considers variant computer viruses as objects lacking domain consistency, MAV has successfully recognized variant viruses by using the three-step null data process powered by the Data Fusion technique

Although this research is in a rule-based anti-virus system, this study is also applied to rule-based recognition systems for strange objects to be learned and identified

Based on the nearest neighbour method, however, this technique has some limitations MAV can only recognize the nearest variant computer viruses When facing an unknown virus, MAV needs an additional decision from a built-in expert system

REFERENCES

[1] Pieter Adriaans, Dolf Zantinge “Data Mining” Addision Wesley

Longman, 1996, 40-41

[2] Baker K., Harris P., O’Brien J., “Data fusion: An appraisal and

experimental evaluation”, Journal of the Market Research Society, 31

(1989), 153-212.

[3] Lejeune M “De l'usage des fusions de données dans les études de

marché”, Proceedings 50th Session of ISI-Beijing, 1995, Tome LVI,

923-935.

[4] Gilbert Saporta “Data fusion and data grafting” CNAM, F75141 Paris

Cedex 03, France Elsevier Science B.V 2002

[5] Hoang Kiem, Nguyen Thanh Thuy, Truong Minh Nhat Quang

“Machine Leaning Approach to Anti-virus Expert System with Nearest

Neighbor Rule-based Structural Risk Minimization” RIVF’05, the 3rd

International Conference in Computer Science: Research, Innovation and Vision for the Future February 2005, Cantho-Vietnam 295-298 [6] Hoang Kiem, Nguyen Thanh Thuy, Truong Minh Nhat Quang “A

Machine Learning Approach to Anti-virus System” Joint Workshop of

Vietnamese Society of AI, SIGKBS-JSAI, ICS-IPSJ and IEICE-SIGAI on Active Mining 4-7 December 2004, Hanoi-Vietnam, 61-65

[7] Nguyen Thanh Thuy, Truong Minh Nhat Quang “A Global Solution to

Anti-virus Systems” The Proceedings of the 1st International

Conference on Advanced Communication Technology 10-12 February

1999, Muju-Korea, 374-377

[8] Nguyen Thanh Thuy, Truong Minh Nhat Quang “Expert System Approach to Diagnosing and Destroying Unknown Computer Viruses.”

The Scientific Conference Proceedings of the 5th ASEAN Science and Technology Week 10-1998, Hanoi-Vietnam

TABLE III

V IRUS S IGNATURE D ATABASE AFTER FIX N ULL

1 Family.a.vir 15 28 03 101 32 27 65 37 81 61

2 Family.b.vir 15 28 03 101 35 27 65 37 85 61

3 Family.c.vir 15 28 03 101 30 27 65 37 90 61

4 Family.d.vir 15 28 03 101 34 27 65 37 84 61

5 Family.e.vir 15 28 03 101 33 27 65 37 83 61

6 Family.f.vir 15 28 03 101 38 27 65 37 90 61

7 Family.g.vir 15 28 03 101 30 27 65 37 88 61

8 Family.h.vir 15 28 03 101 29 27 65 37 87 61

9 Family.i.vir 15 28 03 101 31 27 65 37 92 61

10 Family.j.vir 15 28 03 101 39 27 65 37 91 61

TABLE V

E XPERIMENTAL R ESULTS

Anti-virus Detection Precision Prediction Omission

TABLE IV

T ESTING A NTI - VIRUSES

Anti-virus Manufacture Definition Virus SignatureVirus Version Engine

Norton Anti-virus Symantec 1/25/2006 72,020 9.05.15

Virus Scan McAfee 1/25/2006 N/A 4.0.4682

Bit Defender v.8 SoftWin 1/25/2006 253,993 7.05450

Định dạng
Số trang	4
Dung lượng	217,6 KB