In the paper, we are going to present a spam email filtering method based on machine learning, namely Naïve Bayes classification method because this approach is highly effective. With the learning ability (self improving performance), a system applied this method can automatically learn and ameliorate the effect of spam email classification. Simultaneously, the ability of system’s classification is also updated by new incoming emails, therefore, it is very difficult for spammers to overcome the classifier, compared to traditional solutions.
Trang 17
133
SPAM EMAIL FILTERING BASED ON MACHINE LEARNING
Trinh Minh Duc *
College of Information and Comunication Technology – TNU
SUMMARY
In the paper, we are going to present a spam email filtering method based on machine learning, namely Nạve Bayes classification method because this approach is highly effective With the learning ability (self improving performance), a system applied this method can automatically learn and ameliorate the effect of spam email classification Simultaneously, the ability of system’s classification is also updated by new incoming emails, therefore, it is very difficult for spammers
to overcome the classifier, compared to traditional solutions
Key words: Machine learning, email spam filtering, Nạve Bayes
INTRODUCTION*
The Email classification is actually the
two-class text two-classification problem, that is: the
early dataset consists of spam and non-spam
emails, the texts to be classified as the emails
are sent to inbox The output of the
classification process is to determine the class
label for an email – belonging to either one of
the two classes: spam or non-spam
The general model of the spam email
classification problem can be discribed as
follows:
Figure 1 The spam email classification model
*
Tel: 0984215060; Email: duchoak15@gmail.com
The categorization process can be divided two phases:
The training phase: The input of this phase is
the set of spam and non-spam emails The output is the trained data applied a suitable classification method to serve for the classification period
The classification phase: The input of this
phase is an email, together with the trained data The output is the classification result of the email: spam or non-spam
The rest of this paper is organized as follows
In Sect 2, we formulate Nạve Bayes classification method and our solution In Sect 3, we show experimental results to evaluate the efficiency of this method Finally, in Sect 4, we conclude by showing possible future directions
METHOD [4]
Nạve Bayes method is a supervised learning classification method, and based on probability, based on a probability model (function) The classification process is based
on the probability values of the likelihood of the hypotheses Classification technique of Nạve Bayes is based on Bayes theorem and
is particularly suitable for the cases whose input size is large Although Nạve Bayes is
Trang 2quite simple, its classification capability is
much better than other complex methods Due
to the statistically dependent relaxation
hypotheses, Nạve Bayes method considers
the attributes conditionally independent with
each other
Classification problem
A training set D, where each training instance
x is represented as a n-dimensional attribute
vector x
= (x 1 , x 2 , , x n ) A pre-defined set of
classes C={c 1 , c 2 ., c m }.Given a new instance
z, which class should z be classified into?
Probability P(ck | z) is called the probability
that a new instance z likely belonging to the
class ck is calculated as follows:
c =
i
c C
arg max
∈ P(c | z)i
c =
i
c C
arg max
∈ P(c | z , z , , z )i 1 2 n
c =
i
c C
arg max
∈
P(z , z , , z | c ).P(c )
P(z , z , , z )
c=
i
c C
arg max
P(z , z , , z | c ).P(c )
(P(z , z , , z1 2 n) is the same for all classes)
Assumption in Nạve Bayes method: The
attributes are conditionally independent given
classification:
n
j 1
P(z , z , , z | c ) P(z | c )
=
Nạve Bayes classifier finds the most
probable class for z:
cNB =
i
c C
arg max
∈
n
j 1
P(c ) P(z | c )
=
∏
Nạve Bayes classifying algorithm can be described succinctly as follows:
The learning phase (given a training set) For
each classification (i.e., class label) c i ∈ C
estimate the priori probability P( c i ) For each
attribute value x j, estimate the probability of
that attribute value given classification c i:
P(x j |c i )
The classification phase (given a new
instance) For each classification c i ∈ C,
compute the fomula:
n
j 1
P(c ) P(x | c )
=
∏
Select the most probable classification c*
c* =
i
c C
arg max
∈
n
j 1
P(c ) P(x | c )
=
∏
There are two issues we need to solve:
1 What happens if no training instances associated with class ci have attribute value xj?
P( x j | c i ) = 0, and hence:
n
j 1
P(c ) P(x | c )
=
Solution: to use a Bayesian approach to
estimate P( x j | c i )
i j
j i
i
n(c , x ) mp P(x | c )
n(c ) m
+
=
+
n(c )i : number of training instances associated with class ci
Trang 3135
n(c , x )i j : number of training instances
associated with class ci that have attribute
value xj
p: a prior estimate for P(x | c )j i →
Assume uniform priors: 1
p k
= , if attribute X has k possible values
m: a weight given to prior → To augment
the n(ci) actual observations by an additional
m virtual samples distributed according to p
2 The limit of precision in computers’
computing capability
P(x j | c i ) < 1, for every attribute value xj and
class ci So, when the number of attribute
values is very large
probability
c* =
i
c C
arg max
∈
In the spam email classification problem,
each sample that we consider is an email The
set of classes that each email can belong to
C={ spam, non-spam}
When we receive an email, if we do not know
any information about it, it’s so hard to decide
exactly this email is spam or non-spam If we
have more certain characteristics or attributes
of an email, then we can improve the
efficiency of the classification An email consists of a lot of charateristics such as: title, content, attachment or non…A simple example: if we know that 95% HTML emails
is spam, and we receive a HTML email, thus being able to base oneself on this prior probability in order to compute the probability of email that we receive is spam,
if this probability is greater than the probability given non-spam, it can be concluded that the email is spam, however, this conclusion is not very accurate However, the more if we know much information, the greater the probability of correct classification
is To obtain prior probabilities, using Nạve Bayes method to train the set of early template emails, then using these probabilities
to classify a new email The probability calculation will be based on Nạve Bayes formula With the obtained probability values,
we compare them with each other If spam probability is greater than non-spam, then we can conclude that the email is spam, the opposite is non-spam [5]
EXPERIMENTAL RESULTS
We have implemented a test which applied Nạve Bayes method in email classification The total number of emails in the sample dataset is 4601, including 1813 spam emails (accounting for 39.4%) This dataset which can
Trang 4be downloaded at http://archive.ics.uci.edu/ml/
datasets/Spambase is called Spambase
This dataset is divided into two disjoint
subsets: the training set D_train (accounting
for 66.7%) – for training the system and the
test set D_test (33.3%) – for evaluating the
trained system
In order to evaluate a machine learning
system’s performance, we often use some
measures such as: Precision (P), Recall (R),
Accuracy rate (Acc), Error rate (Err), F1
-measure
Formulas to compute these measures as
follows:
n
−>
−> + −>
n
−>
−> + −>
−> + −>
+ Err = N S S N
−> + −>
+ 1
F
P R
Where:
nS−>S is the number of spam emails which
the filter recognizes as spam
nS−>N is the number of spam emails
which the filter recognizes as non-spam
nN−>S is the number of non-spam emails
which the filter recognizes as spam
nN−>N is the number of non-spam emails which the filter recognizes as non-spam
NN is the total number of non-spam emails
NS is the total number of spam emails
Experimental results on the Spambase dataset
We present test results for two options of the division of the Spambase dataset:
Experiment 1: divide the original Spambase dataset with a proportion k1 = 2
3 , in that,
2 3 the dataset for training and the remaining for testing
Experiment 2: divide the original Spambase dataset with a proportion k2 = 9
10, in that, 9
10 the dataset for training and the remaining for testing
Table 1. Testing results
Experiment 1 Experiment 2
S S
n −> 486 180
S N
n −> 119 2
N N
n −> 726 276
N S
n −> 204 3
F1-measure 75.05% 98.63% The testing result in this experiment 2 have very high accuracy (approximately99%)
Conclusion
In this paper, we have examined the effect of the Nạve Bayes classification method This is
a classifier which has a self-learning ability to improve classification performance Nạve
Trang 5137
Bayes classifier proved suitable for email
classification problem Currently, we are
continuing to build standard as well as
training samples and can adjust the Nạve
Bayes algorithm to improve the accuracy of
classification In the near future, we will build
a standard training and testing data system for
both English and Vienamese This is a big
problem and need to focus more effort
REFERENCES
[1] Jonathan A.Zdziarski, Ending Spam: Bayesian
Content Filtering and the Art of Statistical
[2] Mehran Sahami Susan Dumais David
Heckerman Eric Horvitz., A Bayesian Approach
[3] Sun Microsystem, JavaMail API Design
Specification Version 1.4.
[4] T M Mitchell Machine Learning
McGraw-Hill, 1997
[5] Lê Nguyễn Bá Duy , Tìm hiểu các hướng
tiếp cận phân loại email và xây dựng phần mềm
nhiên, Tp.Hồ Chí Minh, 2005
TĨM TẮT
LỌC THƯ RÁC DỰA TRÊN HỌC MÁY
Trần Minh Đức *
Trường ĐH Cơng nghệ thơng tin và Truyền thơng – ĐH Thái Nguyên
Trong bài báo này tơi giới thiệu phương pháp phân loại thư rác dựa trên học máy, vì cách tiếp cận này cĩ hiệu quả cao Với khả năng học (tự cải thiện hiệu năng), thì hệ thống cĩ thể tự động học và cải thiện được hiệu quả phân loại thư rác Đồng thời, khả năng phân loại của hệ thống cũng sẽ liên tục được cập nhật theo những mẫu thư rác mới và vì vậy, sẽ rất khĩ để các spammers vượt qua được, so với các cách tiếp cận truyền thống khác
Từ khĩa: Học máy, lọc thư rác, Nạve Bayes.
Ngày nhận bài: 13/3/2014; Ngày phản biện: 15/3/2014; Ngày duyệt đăng: 25/3/2014
Phản biện khoa học: TS Trương Hà Hải – Trường ĐH CNTT&TT – ĐH Thái Nguyên
*
Tel: 0984215060; Email: duchoak15@gmail.com