Spam email filtering based on machine learning

In the paper, we are going to present a spam email filtering method based on machine learning, namely Naïve Bayes classification method because this approach is highly effective. With the learning ability (self improving performance), a system applied this method can automatically learn and ameliorate the effect of spam email classification. Simultaneously, the ability of system’s classification is also updated by new incoming emails, therefore, it is very difficult for spammers to overcome the classifier, compared to traditional solutions.

Trang 1

7

133

SPAM EMAIL FILTERING BASED ON MACHINE LEARNING

Trinh Minh Duc *

College of Information and Comunication Technology – TNU

SUMMARY

In the paper, we are going to present a spam email filtering method based on machine learning, namely Nạve Bayes classification method because this approach is highly effective With the learning ability (self improving performance), a system applied this method can automatically learn and ameliorate the effect of spam email classification Simultaneously, the ability of system’s classification is also updated by new incoming emails, therefore, it is very difficult for spammers

to overcome the classifier, compared to traditional solutions

Key words: Machine learning, email spam filtering, Nạve Bayes

INTRODUCTION*

The Email classification is actually the

two-class text two-classification problem, that is: the

early dataset consists of spam and non-spam

emails, the texts to be classified as the emails

are sent to inbox The output of the

classification process is to determine the class

label for an email – belonging to either one of

the two classes: spam or non-spam

The general model of the spam email

classification problem can be discribed as

follows:

Figure 1 The spam email classification model

*

Tel: 0984215060; Email: duchoak15@gmail.com

The categorization process can be divided two phases:

The training phase: The input of this phase is

the set of spam and non-spam emails The output is the trained data applied a suitable classification method to serve for the classification period

The classification phase: The input of this

phase is an email, together with the trained data The output is the classification result of the email: spam or non-spam

The rest of this paper is organized as follows

In Sect 2, we formulate Nạve Bayes classification method and our solution In Sect 3, we show experimental results to evaluate the efficiency of this method Finally, in Sect 4, we conclude by showing possible future directions

METHOD [4]

Nạve Bayes method is a supervised learning classification method, and based on probability, based on a probability model (function) The classification process is based

on the probability values of the likelihood of the hypotheses Classification technique of Nạve Bayes is based on Bayes theorem and

is particularly suitable for the cases whose input size is large Although Nạve Bayes is

Trang 2

quite simple, its classification capability is

much better than other complex methods Due

to the statistically dependent relaxation

hypotheses, Nạve Bayes method considers

the attributes conditionally independent with

each other

Classification problem

A training set D, where each training instance

x is represented as a n-dimensional attribute

vector x

= (x 1 , x 2 , , x n ) A pre-defined set of

classes C={c 1 , c 2 ., c m }.Given a new instance

z, which class should z be classified into?

Probability P(ck | z) is called the probability

that a new instance z likely belonging to the

class ck is calculated as follows:

c =

i

c C

arg max

∈ P(c | z)i

c =

i

c C

arg max

∈ P(c | z , z , , z )i 1 2 n

c =

i

c C

arg max

∈

P(z , z , , z | c ).P(c )

P(z , z , , z )

c=

i

c C

arg max

P(z , z , , z | c ).P(c )

(P(z , z , , z1 2 n) is the same for all classes)

Assumption in Nạve Bayes method: The

attributes are conditionally independent given

classification:

n

j 1

P(z , z , , z | c ) P(z | c )

=

Nạve Bayes classifier finds the most

probable class for z:

cNB =

i

c C

arg max

∈

n

j 1

P(c ) P(z | c )

=

∏

Nạve Bayes classifying algorithm can be described succinctly as follows:

The learning phase (given a training set) For

each classification (i.e., class label) c i ∈ C

estimate the priori probability P( c i ) For each

attribute value x j, estimate the probability of

that attribute value given classification c i:

P(x j |c i )

The classification phase (given a new

instance) For each classification c i ∈ C,

compute the fomula:

n

j 1

P(c ) P(x | c )

=

∏

Select the most probable classification c*

c* =

i

c C

arg max

∈

n

j 1

P(c ) P(x | c )

=

∏

There are two issues we need to solve:

1 What happens if no training instances associated with class ci have attribute value xj?

P( x j | c i ) = 0, and hence:

n

j 1

P(c ) P(x | c )

=

Solution: to use a Bayesian approach to

estimate P( x j | c i )

i j

j i

i

n(c , x ) mp P(x | c )

n(c ) m

+

=

+

n(c )i : number of training instances associated with class ci

Trang 3

135

n(c , x )i j : number of training instances

associated with class ci that have attribute

value xj

p: a prior estimate for P(x | c )j i →

Assume uniform priors: 1

p k

= , if attribute X has k possible values

m: a weight given to prior → To augment

the n(ci) actual observations by an additional

m virtual samples distributed according to p

2 The limit of precision in computers’

computing capability

P(x j | c i ) < 1, for every attribute value xj and

class ci So, when the number of attribute

values is very large

probability

c* =

i

c C

arg max

∈

In the spam email classification problem,

each sample that we consider is an email The

set of classes that each email can belong to

C={ spam, non-spam}

When we receive an email, if we do not know

any information about it, it’s so hard to decide

exactly this email is spam or non-spam If we

have more certain characteristics or attributes

of an email, then we can improve the

efficiency of the classification An email consists of a lot of charateristics such as: title, content, attachment or non…A simple example: if we know that 95% HTML emails

is spam, and we receive a HTML email, thus being able to base oneself on this prior probability in order to compute the probability of email that we receive is spam,

if this probability is greater than the probability given non-spam, it can be concluded that the email is spam, however, this conclusion is not very accurate However, the more if we know much information, the greater the probability of correct classification

is To obtain prior probabilities, using Nạve Bayes method to train the set of early template emails, then using these probabilities

to classify a new email The probability calculation will be based on Nạve Bayes formula With the obtained probability values,

we compare them with each other If spam probability is greater than non-spam, then we can conclude that the email is spam, the opposite is non-spam [5]

EXPERIMENTAL RESULTS

We have implemented a test which applied Nạve Bayes method in email classification The total number of emails in the sample dataset is 4601, including 1813 spam emails (accounting for 39.4%) This dataset which can

Trang 4

be downloaded at http://archive.ics.uci.edu/ml/

datasets/Spambase is called Spambase

This dataset is divided into two disjoint

subsets: the training set D_train (accounting

for 66.7%) – for training the system and the

test set D_test (33.3%) – for evaluating the

trained system

In order to evaluate a machine learning

system’s performance, we often use some

measures such as: Precision (P), Recall (R),

Accuracy rate (Acc), Error rate (Err), F1

-measure

Formulas to compute these measures as

follows:

n

−>

−> + −>

n

−>

−> + −>

+ Err = N S S N

−> + −>

+ 1

F

P R

Where:

nS−>S is the number of spam emails which

the filter recognizes as spam

nS−>N is the number of spam emails

which the filter recognizes as non-spam

nN−>S is the number of non-spam emails

which the filter recognizes as spam

nN−>N is the number of non-spam emails which the filter recognizes as non-spam

NN is the total number of non-spam emails

NS is the total number of spam emails

Experimental results on the Spambase dataset

We present test results for two options of the division of the Spambase dataset:

Experiment 1: divide the original Spambase dataset with a proportion k1 = 2

3 , in that,

2 3 the dataset for training and the remaining for testing

Experiment 2: divide the original Spambase dataset with a proportion k2 = 9

10, in that, 9

10 the dataset for training and the remaining for testing

Table 1. Testing results

Experiment 1 Experiment 2

S S

n −> 486 180

S N

n −> 119 2

N N

n −> 726 276

N S

n −> 204 3

F1-measure 75.05% 98.63% The testing result in this experiment 2 have very high accuracy (approximately99%)

Conclusion

In this paper, we have examined the effect of the Nạve Bayes classification method This is

a classifier which has a self-learning ability to improve classification performance Nạve

Trang 5

137

Bayes classifier proved suitable for email

classification problem Currently, we are

continuing to build standard as well as

training samples and can adjust the Nạve

Bayes algorithm to improve the accuracy of

classification In the near future, we will build

a standard training and testing data system for

both English and Vienamese This is a big

problem and need to focus more effort

REFERENCES

[1] Jonathan A.Zdziarski, Ending Spam: Bayesian

Content Filtering and the Art of Statistical

[2] Mehran Sahami Susan Dumais David

Heckerman Eric Horvitz., A Bayesian Approach

[3] Sun Microsystem, JavaMail API Design

Specification Version 1.4.

[4] T M Mitchell Machine Learning

McGraw-Hill, 1997

[5] Lê Nguyễn Bá Duy , Tìm hiểu các hướng

tiếp cận phân loại email và xây dựng phần mềm

nhiên, Tp.Hồ Chí Minh, 2005

TĨM TẮT

LỌC THƯ RÁC DỰA TRÊN HỌC MÁY

Trần Minh Đức *

Trường ĐH Cơng nghệ thơng tin và Truyền thơng – ĐH Thái Nguyên

Trong bài báo này tơi giới thiệu phương pháp phân loại thư rác dựa trên học máy, vì cách tiếp cận này cĩ hiệu quả cao Với khả năng học (tự cải thiện hiệu năng), thì hệ thống cĩ thể tự động học và cải thiện được hiệu quả phân loại thư rác Đồng thời, khả năng phân loại của hệ thống cũng sẽ liên tục được cập nhật theo những mẫu thư rác mới và vì vậy, sẽ rất khĩ để các spammers vượt qua được, so với các cách tiếp cận truyền thống khác

Từ khĩa: Học máy, lọc thư rác, Nạve Bayes.

Ngày nhận bài: 13/3/2014; Ngày phản biện: 15/3/2014; Ngày duyệt đăng: 25/3/2014

Phản biện khoa học: TS Trương Hà Hải – Trường ĐH CNTT&TT – ĐH Thái Nguyên

*

Tel: 0984215060; Email: duchoak15@gmail.com

Định dạng
Số trang	5
Dung lượng	261,59 KB