using artificial neural networks to identify image spam

USING ARTIFICIAL NEURAL NETWORKS TO IDENTIFY IMAGE SPAM A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Ma

Trang 1

USING ARTIFICIAL NEURAL NETWORKS TO IDENTIFY IMAGE SPAM

A Thesis Presented to The Graduate Faculty of The University of Akron

In Partial Fulfillment

of the Requirements for the Degree

Master of Science

Priscilla Hope August, 2008

Trang 2

USING ARTIFICIAL NEURAL NETWORKS TO IDENTIFY IMAGE SPAM

Priscilla Hope

Thesis

_ _

Dr Kathy J Liszka Dr Ronald F Levant

_ _

Dr Timothy W O'Neil Dr George R Newkome

Trang 3

ABSTRACT

Internet technology has made international communication easy and convenient This convenience has compelled a number of people to rely on electronic mail for almost all spheres of life – personal, business etc Scrupulous organizations/individuals have taken undue advantage of this convenience and populate users’ inboxes with unwanted messages making email spam a menace Even as anti-spam software producers think they have almost solved the problem, spammers come out with new techniques One such tactic in the spammers’ toolbox comes in the form of image spam – messages that contain little more than a link to an image rendered in an HTML mail reader The image typically contains the spam message one hopes to avoid, yet it is able to bypass most filters due to the composition and format of these pictures

This research focuses on identifying these images as spam by using an artificial neural network (ANN), software programs used for recognizing patterns, based on the biological neural networks in our brains As information propagates through a neural network, it “learns” about the data A large collection of both spam and non-spam images have being used to train an ANN, and then test the effectiveness of the trained network against an unidentified or already identified set of pictures This process involves formatting images and adding the desired training values expected by the ANN Several different ANNS have being trained using different configurations of hidden layers and

Trang 4

nodes per layer A detailed process for preprocessing spam image files is given, followed

by a description on how to train an artificial neural network to distinguish between ham and spam Finally, the trained network is tested against both known and unknown images

Trang 5

ACKNOWLEDGEMENTS

This research would not have being possible without Jason Bowling making his ideas available for further studies I’m grateful to him for his generosity I also appreciate Garth Bruen of Knujon for contributing spam images without which my corpus would have being small I appreciate my committee members, Dr Tim O’Neil and Dr Margush, for their insightful corrections My sincerest gratitude goes to Dr Kathy J Liszka, my supervisor, with whose help this research became a joy to work on Thanks

Dr Liszka, you are the best supervisor!

Trang 6

TABLE OF CONTENTS

Page

LIST OF TABLES ………viii

LIST OF FIGURES ……… ix

CHAPTER I INTRODUCTION ……… 1

II THE NATURE OF SPAM ……….4

2.1 Basic Definitions ….……….………5

2.2 History and Statistics ……… ………5

2.3 The Long Arm of Spam ……….………… …………7

2.4 Spam Filters ……… …9

2.5 Who Are Spammers and Why Can’t We Stop Them ……… 11

2.6 The Cost of Spam …… ……… 12

2.7 Why Are We Reading Spam ……… ………….13

2.8 Getting Past the Spam Filter …… ……… …… 15

2.9 Related Research ……… …… 16

Trang 7

3.1 Image Spam Creating Techniques ……… ………… 18

3.2 Image Formats ……… ………… 22

3.3 Image Preparation ……… ……… 22

3.4 Corpus ……… …… 23

IV THE ARTIFICIAL NEURAL NETWORK ………27

4.1 Fast Artificial Neural Network (FANN) ……… ……… 29

4.2 Creating the Artificial Neural Network ……… ……30

4.3 Training the Artificial Neural Network ……… 32

4.4 Testing the Artificial Neural Network ……… … 36

V TRAINING RESULTS….……….38

5.1 Training files ……… ……….38

5.2 Test Results ……… 41

5.3 Sample Runs ……… …….41

VI CONCLUSION AND FUTURE WORK ……….51

REFERENCES ……… ……….53

APPENDIX ……….……… ……….….56

Trang 8

LIST OF TABLES

Table Page 3.1 Corpus Statistics ……… ……… 24 5.1 Training Image Times for 50 Hidden Neurons ……… ……….…40 5.2 Training Image Times for 75 Hidden Neurons ……… ………40

Trang 9

LIST OF FIGURES

Figure Page

1.1 Image Spam Examples ………2

2.1 The First Generally Acknowledged Email Spam ……… 6

2.2 Sample Text-based Spam Message ……….8

3.1 Text-only image… ……… 20

3.2 Assembled Images ……… 21

3.3 Original six individual images……… 21

3.4 Script for Checksum ……….……25

3.5 Unix Script for Reformatting File Names ………26

4.1 Perceptron or feed-forward ANN ……….28

4.2 Script automating executing image2fann utility ……… 30

4.3 Sample content of a file containing a set of image files to be run through image2fann utility ………31

4.4 Sample preprocessed images to be trained ……… 31

4.5 ANN for Spam Image Identification ………33

4.6 Sample partial output from train.c ………35

4.7 Process flow of ANN training and testing ……….36

4.8 Process of testing a network……… 37

Trang 10

5.1 Sample preprocessed images to be tested ……….38 5.2 Sample output file from train.c (partial) ….……… 39 5.3 Sample output file from test.c (partial)……… 41 5.4 ANN of 572 trained images using 50 hidden neurons and tested

with 53 untrained images ……… 42

5.5 ANN of 572 trained images using 75 hidden neurons and tested

with 2000 trained images ……… 45 5.9 ANN of 2000 trained images using 75 hidden neurons and tested

with 100 untrained images ………46 5.10 ANN of 2000 trained images using 50 hidden neurons and tested

with 2000 trained images ……… 46 5.11 ANN of 2000 trained images using 50 hidden neurons and tested

with 100 untrained images ……… 47 5.12 ANN of 2000 trained “images with mostly words” using 75 hidden

neurons and tested with 100 trained images ……….48 5.13 ANN of 2000 trained “images with mostly words” using 75 hidden

neurons and tested with 100 untrained images ……… …48 5.14 ANN of 2000 trained “images with mostly words” using 50 hidden

neurons and tested with 100 trained images ……….49 5.15 ANN of 2000 trained “images with mostly words” using 50 hidden

neurons and tested with 100 untrained images ……….… 49 5.16 Jason Bowling on a hiking trip ……….50

Trang 11

CHAPTER I INTRODUCTION

Select – delete – repeat It’s what most email users spend the first ten minutes of every day doing purging spam from their inboxes It has become as popular in casual conversation as the weather Clearly spam is not going away, at least not in the foreseeable future People still respond to it, buy products from it, and are scammed by it Communication on the Internet has been eagerly encouraged due to its ease of use, opportunities to develop personal and professional contacts with colleagues around the world that previously would have been difficult or impossible, and the possibility to broadcast questions, discussion topics, opinions, documents, and more to thousands of colleagues around the world virtually simultaneously Communicating on the internet come in many forms: email, discussion groups, Usenet news, chat groups, IRC (Internet relay chat), video and audio conferencing, and Internet Telephony / SMS (short message service) People’s response to communicating on the Internet was great since they wanted

to share ideas in an inviting, trusting atmosphere Unfortunately, the boom in cyberspace population came along with other social vices as happens in every growing human society Large chunks of the Internet took shape as a market place, a soapbox, and mischievous, even hostile playgrounds, resulting in a less trusting atmosphere Suspicion

Trang 12

has overridden trust, resulting in some users wanting to shut down communication rather than open it up

Figure 1.1 Image Spam Examples

Many factors contribute to this hostile environment with one playing a major role adding to the slow erosion of the old, idealistic Internet philosophy This factor has led people to consider the Internet as a mistrustful atmosphere – a notion that cyberspace is full of pornography and people trying to hijack bank accounts, steal identities, and otherwise manipulate, deceive, and trick them It's a nuisance that every Internet user deals with everyday It is called spam!!

Filters are available to combat these unsolicited nuisances, but spammers continually develop new techniques to avoid detection by filters This thesis focuses on one specific category of unsolicited bulk email – image spam It is a fairly recent phenomenon that has appeared in the past few years In 2005, it comprised roughly 4.8% of all emails, then grew to an estimated 25% by mid 2006 [1] Image spam reached its peak in January 2007 accounting for 52% of all email spam [2] They come as image attachments that contain

Trang 13

text with what looks like a legitimate subject and from address These nuisances are successfully getting by traditional spam filters and optical character recognition (OCR) systems As a result, they are often referred to as OCR-evading spam images Two common examples are shown in Figure 1.1 These come in many forms by way of file types, multipart images with images split into multiple images, and rotated by a slight degree

This research examines a method for identifying image spam by training an artificial neural network Chapter two presents an overall view of the spam problem and a brief summary of current research A detailed process for preprocessing spam image files is given in chapter three, along with a discussion of how the corpus was developed A description of artificial neural networks is given in chapter four with instructions on how

to train the network to distinguish between ham and spam Chapter five presents results derived from creating and testing the trained network against unknown and known images Finally, conclusions and future work are discussed in chapter six

Trang 14

CHAPTER II THE NATURE OF SPAM

Many believe spam is an acronym for "sales promotional advertising mail" or

"simultaneously posted advertising message Other acronyms associated with Spam include: UBE (Unsolicited Bulk Email), MMF (Make Money Fast) and MLM (Multi-Level Marketing) There seems to be two popular theories of why the name spam is associated with Unsolicited Commercial Email (UCE) Most seem to associate spam with

the brand name product (SPAM - Shoulder Pork and hAM"/"SPiced hAM”) marketed by

Hormel SPAM luncheon meat is a canned precooked meat product made by the Hormel

Foods Corporation [3] Email spam, like its lunchmeat namesake, has no one asking for

or wanting it If they do happen to get it, they most likely throw it away Another group

implies the name spam was borrowed from a British television series, Monty Python’s

Flying Circus, in which actors sang a song entitled “Spam” with the word “spam”,

repeated over and over drowning all sounds [7]

Which ever theory one goes with, there is some truth underneath all the silliness Spam obscures legitimate business and personal correspondence that we want to read Worse, some are downright unpleasant, even offensive to view once opened

Trang 15

2.1 Basic Definitions

Email spam comes with a variety of definitions including:

• unsolicited e-mail on the Internet [4];

• the abuse of electronic messaging systems to indiscriminately send unsolicited bulk messages as defined by[5]; and

• unsolicited e-mail, often of a commercial nature, sent indiscriminately to multiple mailing lists, individuals, or newsgroups; junk e-mail [6]

Some examples of spam include:

• unsolicited communication, including "pop-ups" and "pop-unders";

• irrelevant, inappropriate, or repetitious e-mail or message board post;

• advertisement for some product or service;

• commercial, political and social commentaries sent indiscriminately to many recipients; and

• email chain letters

2.2 History and Statistics

May 3, 2008 marked the 30th Anniversary of email spam The first email spam was recorded on May 3, 1978 from a Digital Equipment Corporation marketing representative, Gary Thuerk He sent this email, shown in Figure 2.1, to all Arpanet addresses on the west coast [7] [8] Technically, by definition, the very first spam was

Trang 16

recorded in a telegram on September 13, 1904 [5] Undoubtedly, this is not what Joseph Henry and Samuel Morse had imagined!

Mail-from: DEC-MARLBORO rcvd at 3-May-78 0955-PDT

Date: 1 May 1978 1233-EDT

From: THUERK at DEC-MARLBORO

Subject: ADRIAN@SRI-KL

DIGITAL WILL BE GIVING A PRODUCT PRESENTATION OF THE NEWEST MEMBERS OF

THE DECSYSTEM-20 FAMILY; THE DECSYSTEM-2020, 2020T, 2060, AND 2060T

THE DECSYSTEM-20 FAMILY OF COMPUTERS HAS EVOLVED FROM THE TENEX OPERATING

SYSTEM AND THE DECSYSTEM-10 COMPUTER ARCHITECTURE BOTH THE DECSYSTEM-2060T AND 2020T OFFER FULL ARPANET SUPPORT UNDER THE TOPS-20 OPERATING SYSTEM

THE DECSYSTEM-2060 IS AN UPWARD EXTENSION OF THE CURRENT DECSYSTEM 2040 AND

2050 FAMILY THE DECSYSTEM-2020 IS A NEW LOW END MEMBER OF THE

DECSYSTEM-20 FAMILY AND FULLY SOFTWARE COMPATIBLE WITH ALL OF THE OTHER

DECSYSTEM-20 MODELS WE INVITE YOU TO COME SEE THE 2020 AND HEAR ABOUT THE

DECSYSTEM-20 FAMILY AT THE TWO PRODUCT PRESENTATIONS

WE WILL BE GIVING IN CALIFORNIA THIS MONTH THE LOCATIONS WILL BE:

(4 MILES SOUTH OF S.F AIRPORT AT BAYSHORE, RT 101 AND RT 92)

A 2020 WILL BE THERE FOR YOU TO VIEW ALSO TERMINALS ON-LINE TO OTHER

DECSYSTEM-20 SYSTEMS THROUGH THE ARPANET IF YOU ARE UNABLE TO ATTEND,

PLEASE FEEL FREE TO CONTACT THE NEAREST DEC OFFICE FOR MORE

INFORMATION ABOUT THE EXCITING DECSYSTEM-20 FAMILY.

Figure 2.1 The First Generally Acknowledged Email Spam Email spam with a specific commercial bent started with what has now been dubbed the “Green Card” spam This anti-historical event took place on March 15, 1994[5] when two attorneys, Laurence Canter and Martha Siegel, put together a bulk Usenet posting to advertise their services to help immigrants obtain visas that were then known as a “green card” Savvy entrepreneurs caught on quickly and spam has been growing since with no end in sight

Trang 17

The following figures give an estimate of absolute numbers through the years:

• 1978 – message sent to 600 addresses[7]

• 1994 – spam sent to 6000 newsgroups, getting to millions of people[9]

• 2005 – (June) 30 billion spam per day[10]

• 2006 – (June) 55 billion spam per day [10]

• 2006 – (December) 85 billion spam per day[ 5]

• 2007 – (February) 90 billion spam per day[5]

• 2007 – (June) 100 billion spam per day[11]

• 2007 – (November) between 65% and 85% of email is spam [12]

• 2008 – (July) 87.56% of email is spam with image spam being 12.87% [35]

2.3 The Long Arm of Spam

Virtually no electronic form of communication is safe from this nuisance Although

we normally associate spam with email, it comes in many forms, including instant

messaging (spim), chat rooms, newsgroups, mobile phones, online game messaging, search engine spam (spamdexing), blogs, wikis and guest books, and video sharing sites

In short, where there is technology, spam is sure to follow

There are two basic forms of spam received in emails:

• Text-based – an email consisting of text only to convey the senders information as shown in Figure 2.2

• Image-based – the spammer’s message is sent in the form of a graphic or an image, as shown in Figure 1.1

Trang 18

free casino games, bingo cards, free Poker chips, Sportsbook, no deposit

online casino where you find more games, more winners, more often!

www.casino-vip1.com?” – Sender: Julia Broussard gplczvbf@ymail.yu.edu,

Subject: Native American casino

Figure 2.2 Sample text-based spam message Spam with an attached image is a relatively new phenomenon, which only started to appear in numbers in the second half of 2005 Image spam exploded in mid 2006 and by year’s end, over 50% of total spam received was image spam It has since declined and now account for around 20% [13] According to the paper, Image Spam – the New Face

of Email Threat, image spam forms around 12.87% of total email spam which forms around 87.56% of all email [35]

Popular Internet browsers, such as Firefox and Internet Explorer, coupled with powerful search engines like Google, have changed our lives, as we search for answers to movie trivia and place bids on EBay Unfortunately, there seems to be a correlation between time spent on the Internet and the amount of spam received Spammers obtain our email addresses a number of common ways Some of these include:

• Scanning Usenet for email addresses

• Using programs to extract email addresses from article headers

• Harvesting subscribers to mailing lists from servers

• Address harvesting programs (bots or spiders) crawling through the Internet looking for email addresses in web pages An especially good place to look is in the <mailto> html tag

Trang 19

• People finder sites also contribute to the spammers email list For example, Microsoft Network’s Hotmail automatically adds new email addresses to some white page directories

• Email addresses can be bought, usually cheaply, from others who have compiled the addresses either legitimately or not

• The three major domain contact points found by searching “whois-style” usually

provide email contacts for administrative, technical and billing staff

• A dictionary search of email servers of large email hosting companies is done by picking known URL suffixes (e.g computer.org) and sending emails to prefixes and seeing what doesn’t come back

• Chat rooms also serve as a hotbed for email harvesting

2.4 Spam Filters

Spammers have two stages of spamming: landing the mail in a user’s inbox and enticing the unsuspecting user to read it Similarly, users have two defensive tactics They can take on the arduous task of weeding the unwanted spam by hand or they can use software filters Spam filters prevent spam from reaching an inbox Manual weeding may still need to be done for those spams that successfully bypass the filter Manual weeding

is also necessary to identify and retrieve back those legitimate messages that have found themselves marked as spam Unfortunately, it is left to the user to identify the spam and manually delete it or report to the spam filter that a message has succeeded in circumventing This is where the spammers’ trickery comes in to play Some users try to

Trang 20

save labor and time by identifying and deleting spam based on the sender’s address or subject matter There are myriad spam filters available on the market, some free, with limited capabilities, and other ranging in price Examples include MacAfee, AVG, SurfControl, Symantec, and TrustedSourceTM Among the capabilities of many email filters are:

• Compare incoming email with a classified database of spam and junk email content

• Use fifteen pre-defined content dictionaries classified by categories such as

“Spam”, “Adult” or “Hate Speech”

• Use context sensitive language analysis

• Can be dynamically trained to recognize and understand an organization’s specific proprietary content

• Strip HTML out of the message

• Verify the existence of an email sender by using reverse client DNS lookup

• Allow users to create their own “explicit deny list”

• Message reputation and fingerprinting checking to see if the email content has elements of spam that have been seen before

• Image fingerprinting checks images to see if they contain similarities to cataloged spam images

• Image property space, a technique that uses rules to extract properties of images

in an email that might be spam

• Analyze the format, layout and structure of an email Most spammers use the

Trang 21

• Image hashing, a technique that creates a digital signature of the actual image This is effective when the same image is sent for several days in a row as is common with some spammers Otherwise, it doesn’t help

2.5 Who are Spammers and Why Can’t We Stop Them?

There are professional spammers who are in the business for many reasons ranging from advertisement to malware propagation This does not take into account the junk email, in the form of jokes from family, friends and colleagues at work Surprisingly, or maybe not, there are actually a few anti-spam software companies that use spam to

advertise their latest anti-spam products Mainsleaze is the name given to what are

generally considered reputable companies that have real products, as opposed to night outfits promising a happier life with Viagra One example of mainsleaze is Kraft Food’s spam marketing blitz for its Gevalia gourmet coffee products [5] Another example is IDate’s email harvesting of subscribers to the popular social networking site Quechup [14]

fly-by-According to new research, it appears that there are just 10 domain name registrars that make up more than 75% of all Web sites advertised through spam [15] These are:

• Xinnet Bei Gong Da Software

• BEIJING Networks

• Todynamic

• Joker

• eNom,Inc

Trang 22

2.6 The Cost of Spam

Unlike postal junk mail which requires paper, an envelope, and a stamp, operating costs for the spammer are relatively small Their major expenditure/effort is in managing mailing lists, making a massive campaign economically viable The public and ISPs on the other hand, bear the cost of low productivity, reduced bandwidth, and potential fraud The direct costs include consumption of network resources, disk space, the cost to

Trang 23

purchase and maintain commercial spam filters, and, of course, the human time and attention to dismiss those unwanted messages

Of a more serious nature, a whole new breed of crimes has evolved as a result of spam Our increasing digital presence has made us vulnerable to financial, identity, data and intellectual property theft Virus and other malware infections harm our data and cause data loss resulting in harm ranging from lost productivity to major business losses To an organization, spam is not only a nuisance, it is expensive

2.7 Why Are We Reading Spam?

We all claim that we delete our spam, but if that were true, spammers would have no reason to continue pushing it through the pipe to us Obviously, enough people are reading enough spam to make it lucrative for spammers to continue The strategy adopted

by the spammer consists of finding victim addresses, landing the mail in users’ in-boxes and enticing them to open it A few of the tactics include [17]:

• Cultural preoccupation: These types of messages tap into society’s cultural obsession of comparison, better known as “keeping up with the Jones” “More” or

“less” seems to be on everyone’s mind Spammers capitalize on this, sending messages with subjects implying one or the other, with the sender appearing to come from a legitimate well-known online service

• Typical concerns: Money, success, weight loss, and freedom from depression are basic human needs which can be played on by the spammer to entice an email

Trang 24

user to read spam They promise the golden elixir for our insecurities and deepest desires

• Sexuality: Most women want to be attractive to men; and vice versa, in whatever form that takes If you have read email in the last week, you’ve been subjected to these myriad types of ads To some users it’s an offensive eyesore while others welcome it and take the bait

• Anxiety: Usually messages sent in this category induce anxiety in users, thereby making it difficult for them to ignore the message The header often has nothing

to do with the content presented in the email The purpose is simply to get the user to open the email An example would be something like “Your PayPal account has been compromised.” In some cases, the attached email may be an actual PayPal phishing attempt, but often, the user opens it to find a link to a pharmaceutical site or other scam

• Faking a personal touch: When these first started arriving, it was easy to fall prey

to emails with one’s name in the subject and a sender with a real name instead of

an email address This is still an effective tactic to draw in the nạve user

• Faking replies and interactions: Usually a message prefixed with “re:” makes users think it might be a reply to a message they sent or a reply to message in an email group they might belong to Sneaky, but effective

• Faking informality and acquaintance: The idea here is to create subjects worded in

a casual, friendly style This gives the impression that the message is coming from someone familiar to the user An example might be a subject line of “How’s it

Trang 25

2.8 Getting Past the Spam Filter

There are many techniques employed by email spammers to by-pass spam filtering engines According to John Graham-Cumming [18], spamming techniques fall into several categories Although he has listed seven general tactics, the most common seem

to target traditional Bayesian filters The three techniques that are relevant to this thesis include:

• Bad Word Obfuscation: Filters typically scan a message looking for certain questionable words and phrases (Viagra, weight loss, phentermine, bad credit, etc.) If these words and phrases can somehow be masked so that they aren’t identified by the filter in this category, the email has a much better chance of slipping through Intentional misspelling of these words is a simplistic, but effective approach HTML tags, used creatively, can be used to show the message

to the user in readable form, while disguising it from the filter

• Good Word Insertion: These emails were originally very puzzling to the uninformed user In addition to the annoying commercial, they have additional

“harmless” words embedded in some fashion, or simply outright appended to the end of the message, that read like nonsensical rambling The intent is clearly to confuse a statistical filter

• Token Avoidance: Rather than worry about disguising “bad” words, or tipping the scale with “good” words, these techniques attempt to prevent a filter from tokenizing a message in the first place This is where image spam enters the

Trang 26

Learning Fast Classifiers for image spam [19] is one such attempt They assert that their method exceeds 90% accuracy They classify spam and ham images focusing on simple properties of the image and also a just-in-time (JIT) feature extraction which adds

to the feature classification dataset as needed by the classifier The features of images used for basic classification are file format, file size, image metadata, and image size Other image features they considered are average color, color saturation, edge detection, prevalent color coverage and random pixel test Their feature set evaluation was accomplished using a number of learning models – Maximum Entropy, nạve Bayes, and ID3 decision tree All three algorithms were implemented in Mallet, using the Java 1.5 imageio library for image processing and feature extraction For “feature predictiveness”, they computed mutual information for each feature with the target label spam or ham [19]

Trang 27

Filtering Image Spam with Near-Duplicate Detection [20] is another attempt at filtering spam images In their research, they use a “content-based image similarity searching” technique to classify spam images A false positive rate of 0.001% is maintained through their detection system [20]

Another approach [21] is mostly based on extracting text regions inside the images of interest and then using something called a Support Vector Machine (SVM) to distinguish between ham and spam images An SVM is a linear classifier that groups data points based on hyperplanes [22] They claim an 85% correct detection rate with about 15% false positive rate

Finally, research has been conducted to identify a number of useful visual features including banner images, computer generated graphics and embedded-text These features are then combined with message text features They also use an SVM to train their classifier to distinguish between ham and spam This approach achieved about an 81% detection rate with about a 1% false positive rate with their corpus [23] [24] ANNs have been used in [25] to identify spam by looking at the text-based header portion of spam email PUREmail is a second generation email filter uses artificial Intelligence to process images by visualization [35]

Although the results given are encouraging, this thesis uses an artificial neural network

to classify spam and ham Artificial Neural Networks have the capability to mimic human intelligence when trained with good input data It is the first research of this kind to be conducted for image spam [26] In the next chapter, the use of images in spam and the manipulation of the images for this thesis research are addressed

Trang 28

CHAPTER III IMAGES AND THE CORPUS

From mere observation, one can conjecture that image spam is yet another clever way

to avoid anti-spam devices Spammers have resorted to this format, with some of the following advantages:

• Their message by-passes anti-spam techniques that only scan the message body for spam-like text

• Pretty graphics can allow for a colorful and “professional looking” message, drawing more attention to their email

• It opens up a world of new techniques for spammers to randomize each message, again by-passing anti-spam techniques based on signatures of the message or image

• They have been fairly successful at defeating techniques based on Optical

Character Recognition (OCR), a technology that extracts text from images

3.1 Image Spam Creating Techniques

The Spammer’s Compendium [18] provides a compilation of spammer techniques collected and classified by a community of volunteers The site is maintained by John

Trang 29

Graham-Cumming is used as educational resource as well as for research They have grouped spam messages based on a general classification of techniques used to avoid the filtering process These include things like adding “good” words to skew a filter that looks at the percentage of “bad” words (ex Viagra) Another classification includes those spam messages that make tokenizing a message difficult, so that the words in the

message don’t even make it to “bad word” analysis Among the classes described at the Spammer’s Compendium are [18]:

• Text-only images: An example is shown in Figure 3.1 To a user, it appears like a normal text email but on a critical examination, one realizes that it’s an image

• Sliced Images

Other image spam contains multiple images joined like a jigsaw puzzle [27] Figure 3.2 shows an assembled image, and Figure 3.3 shows the individual image pieces that make up what the user is intended to see while fooling the spam filter This image was taken “from the wild” from the Knujon data set collected on April 22, 2008

• Randomization techniques

A spammer usually alters individual pixels in the image that is difficult to distinguish it from the original image Due to the randomization of the pixels each iteration of the image will appear completely different to many image spam filters

Trang 30

• Wild backgrounds

Image spammers are now using highly colored and patterned backgrounds, uneven letters, and randomly inserted pixels around the border Each image is unique and hard to read by any software attempting to use Optical Character Recognition, a technology that aims to scan an image and extract text, which requires known fonts to

be effective

Figure 3.1 Text-only image

• Multi-frame animated images

Spammers send multiple frames containing an animated gif image with their message The frames rotate at a faster rate so that the human eye cannot detect the animation but only sees the final result

Trang 31

Figure 3.2 Assembled Images

Figure 3.3 Original six individual images

• Stock Splits

This new trick, associated with a certain type of stock spam appeared in June 2006 The Spammer splits the original image into multiple images, again to avoid signature filters and possibly image scanning software The email client reassembles the image when the message is opened

Trang 32

3.2 Image Formats

Image formats provide a standard method of organizing and storing image data There are many formats but our research focused on bmp, gif, png and jpeg These are the only formats we have encountered in our quest to create a decent sized corpus

Bmp called the bitmap format of DIB (device-independent bitmap), is used to store bitmap digital images specifically on Microsoft systems Bmp files are usually large GIF stands for Graphics Interchange Format and is used to store images limited to an 8-bit palette of 256 colors PNG (Portable Network Graphics) was the open source successor to gif graphics but supports more colors, up to 16 million JPEG (Joint Photographic Experts Group) supports 8bits per color for a 24-bit total and produces relatively small file sizes

All of the images in our experiment are compressed to jpeg due to its advantage of having a smaller file size The degree of compression can be adjusted between image size and storage Other reasons for choosing the jpeg format include the fact that almost all digital cameras and other photographic image capture devices have this option It is also the common format for transmitting images on the World Wide Web [28]

3.3 Image Preparation

The program in the research, image2fann, is written in the C language It takes images in almost any common format and uses ImageMagick [29], an open source utility

Trang 33

that converts and formats images from virtually any format to another In this case, the formats gif, bmp and png files, are converted to jpg and scaled down to a standard size for processing The same program also sets up the rest of the format for the ANN input file The program handles multipart images, creating separate lines of data for each frame, or single part images The process works as follows

• Use the ImageMagick utility to make a rescaled 150 × 150 pixel jpg image from the source input file

• Use the ImageMagick utility a second time to convert the 150 × 150 pixel jpg image into a 150 × 150 pixel 8-bit grayscale image

• At this point, the image consists of 150 × 150 = 22,500 values that range from 0 –

Định dạng
Số trang	67
Dung lượng	1,14 MB