1. Trang chủ
  2. » Công Nghệ Thông Tin

IT training the automated traffic handbook by andy still khotailieu

84 47 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 84
Dung lượng 7,31 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

21 Search Engine Spiders 22 Content Theft 22 Price Scraping 24 Content/Price Aggregation 25 Affiliates 26 User Data Harvesting 26 6.. For example, Frost & Sullivan describe bot traffic a

Trang 1

Andy Still

Managing Spiders, Bots, Scrapers,

and Other Non-Human Traffic

The Automated

Traffic Handbook

Compliments of

Trang 2

How I t Works

Desi gned by web performance experts, TrafficDefender i s a cl oud servi ce that si ts i n front of your websi te or API control l ng the flow of traffic to i t Our hi ghl y resi l ent pl atform guarantees upti me and protects your websi te from mal i ci ous bot acti vi ty, enabl i ng you to generate maxi mum revenue over your si te’ s busi est peri ods.

Guarantee websi te upti me, protect your busi ness from mal i ci ous bots, ensure excel l ent customer experi ence and

maxi mi se revenue generated by web appl i cati ons.

The I ndustry Leadi ng Web Traffic

Management System

Learn more at i ntechni ca com/trafficdefender

Trang 3

Boston Farnham Sebastopol Tokyo

Beijing Boston Farnham Sebastopol Tokyo

Beijing

Trang 4

[LSI]

The Automated Traffic Handbook

by Andy Still

Copyright © 2018 O’Reilly Media, Inc All rights reserved.

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/institutional sales department: 800-998-9938 or

corporate@oreilly.com.

Editor: Virginia Wilson

Production Editor: Nicholas Adams

Copyeditor: Jasmine Kwityn

Interior Designer: David Futato

Cover Designer: Randy Comer

Illustrator: Rebecca Demarest

Tech Reviewers: Daniel Huddart, Andy Lole, and Jason Hand

February 2018: First Edition

Revision History for the First Edition

2018-02-02: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc The Automated Traffic Handbook, the cover image, and related trade dress are trademarks of

O’Reilly Media, Inc.

While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limi‐ tation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsi‐ bility to ensure that your use thereof complies with such licenses and/or rights This work is part of a collaboration between O’Reilly and Intechnica See our state‐ ment of editorial independence.

Trang 5

Table of Contents

Introduction vii

Part I Background 1 What Is Automated Traffic? 3

Key Characteristics of Automated Traffic 4

Exclusions 4

2 Misconceptions of Automated Traffic 7

Misconception: Bots Are Just Simple Automated Scripts 7

Misconception: Bots Are Just a Security Problem 9

Misconception: Bot Operators Are Just Individual Hackers 9

Misconception: Only the Big Boys Need to Worry About Bots 10

Misconception: I Have a WAF, I Don’t Need to Worry About Bot Activity 11

3 Impact of Automated Traffic 13

Company Interests 13

Other Users 14

System Security 14

Infrastructure 15

iii

Trang 6

Part II Types of Automated Traffic

4 Malicious Bots 19

Application DDoS 20

5 Data Harvesting 21

Search Engine Spiders 22

Content Theft 22

Price Scraping 24

Content/Price Aggregation 25

Affiliates 26

User Data Harvesting 26

6 Checkout Abuse 27

Scalpers 28

Spinners 29

Inventory Exhaustion 30

Snipers 30

Discount Abuse 31

7 Credit Card Fraud 33

Card Validation 33

Card Cracking 34

Card Fraud 34

8 User-Generated Content (UGC) Abuse 35

Content Spammer 36

9 Account Takeover 37

Credential Stuffing/Credential Cracking 37

Account Creation 38

Bonus Abuse 39

10 Ad Fraud 41

Background to Internet Advertising 42

Banner Fraud 44

Click Fraud 45

CPA Fraud 46

Cookie Stuffing 46

Affiliate Fraud 47

Arbitrage Fraud 48

iv | Table of Contents

Trang 7

11 Monitors 51

Availability 52

Performance 52

Other 52

12 Human-Triggered Automated Traffic 53

Part III How to Effectively Handle Automated Traffic in Your Business 13 Identifying Automated Traffic 57

Indications of an Automated Traffic Problem 57

Challenges 58

Generation 0: Genesis—robots.txt 60

Generation 1: Simple Blocking—Blacklisting and Whitelisting 60

Generation 2: Early Bot Identification—Symptom Monitoring 61

Generation 3: Improved Bot Identification—Real User Validation 62

Generation 4: Sophisticated Bot Identification—Behavioral Analysis 64

14 Managing Automated Traffic 67

Blocking 68

Validation Requests 69

Alternative Servers/Caching 71

Alternative Content 71

Conclusion 73

Table of Contents | v

Trang 9

Web traffic consists of more than just the human users who visityour site In fact, recent reports show that human users are becom‐ing a minority The rest belongs to an ever-expanding group of traf‐

fic that can be grouped under the heading automated traffic.

Terminology

The terms automated traffic, bot traffic, and

non-human traffic are equally common and are used inter‐

changeably throughout this book

As long ago as 2014, Incapsula estimated that human traffic onlyaccounted for as little as 39.5% of all traffic they saw This trend ispredicted to continue, with Cisco estimating that automated trafficwill grow by 37% year on year until 2022

However, this is not simply a growth in the quantity of automatedtraffic but also in the variety and sophistication of that traffic Newparadigms for interaction with the internet, more complex businessmodels and interdependence between sources of data, evolution ofshopping methods and habits, increased sophistication of criminalactivity and availability of cloud-based computing capacity, are allconverging to create an automated traffic environment that is evermore challenging for a website owner to control

vii

Trang 10

It’s Not All Good or Bad

It is simplistic to think of automated traffic as being all goodies andbaddies However, the truth is much more nuanced than that Aswe’ll discuss, there are clear areas of good and bad traffic but there

is a gray area in between where you will need to assess the positivity

or negativity for your situation

This growth poses a number of fundamental questions for anyonewith responsibility for maintaining efficient operation or maximumprofitability of a public-facing website:

• How much automated traffic is hitting my website?

• What is this traffic up to?

• How worried should I be about it?

• What can I do about it?

The rest of this book will help you understand how you can provideanswers to these questions

Terminology

The challenge of automated traffic applies to anyone

who runs a public-facing web-based system, whether

that is a traditional public website, complex web-based

application, SaaS system, web portal, or web-based

API For simplicity I will use the generic term website

when referring to any of these systems

Likewise I will use website owner to refer to the range

of people who will be responsible for identifying and

managing this problem—from security and platform

managers to ecommerce and marketing directors

I will use the term bot operator to identify the individ‐

ual or group that is operating the automated traffic

viii | Introduction

Trang 11

PART I

Background

Before going into detail about what automated traffic is doing onyour website and how this can be addressed it is important that wehave a good shared understanding of what automated traffic encom‐passes

The following chapters will give a brief introduction to core ele‐ments of automated traffic and clarify some of the common miscon‐ceptions that people hold about the nature and complexity of bottraffic and the bot operators

Trang 13

CHAPTER 1

What Is Automated Traffic?

There is a range of different definitions of what can be classed asautomated traffic

For example, Frost & Sullivan describe bot traffic as “computer pro‐grams that are used to perform specific actions in an automatedfashion,” Akamai has defined it as “automated software programsthat interact with websites,” and Wikipedia defines a bot as “a soft‐ware application that runs automated tasks (scripts) over the Inter‐net,” whereas Hubspot says “A bot is a type of automated technologythat’s programmed to execute certain tasks without human interven‐tion.”

For the purposes of this book I will use the following description forautomated traffic, which I feel captures the essential details of what

is meant by the term and removes some of the vagaries included inthe other descriptions:

Automated traffic is any set of legitimate requests made to a website that is made by an automated process rather than triggered by a direct human action.

History of Automated Traffic

The history of the type of automated traffic I am discussing herecan be traced back to 1988 with the creation of IRC bots such as the

Hunt the Wumps game and Bill Wisner’s Bartender It wasn’t until

1994, however, that the first search engine spiders were created by

3

Trang 14

WebCrawler (later purchased by AOL) GoogleBot followed in1996.

Key Characteristics of Automated Traffic

For the purposes of this book, I will have a limited definition ofautomated traffic; this is not to say that other types of automatedtraffic are not a concern, just that they are addressed elsewhere

Web-based Systems

The automated traffic discussed in this book is targeted at based systems and excludes other types of traffic, such as automatedemails

web-Layer 7

Automated traffic operates at layer 7 of the OSI Model—in otherwords, it operates at the application level, making HTTP/HTTPSrequests to websites and receiving responses in the same format.Anything that interacts with servers via any other means is classed

as outside the scope of this book

Legitimate Requests

Automated traffic is defined as traffic that makes legitimate requests

to websites (i.e., requests formulated in the same way as those made

by human users) This means that the automated traffic that is iden‐tified as negative is focused on exploiting weaknesses in businesslogic of systems, not exploiting security weaknesses

Exclusions

The following types of traffic, which could be categorized as auto‐mated traffic, have been excluded from any discussion within thisbook The reason for this exclusion is that they are subjects in theirown right and are well catered for in other literature, with a range ofwell-established products and solutions in existence to mitigate theissues created

Their exclusion from this work does not imply that they are notworthy subjects of concern for website owners They are, in fact,

4 | Chapter 1: What Is Automated Traffic?

Trang 15

very real threats that should be handled as part of any website man‐agement strategy.

DDoS (Distributed Denial of Service)

DDoS is a low-level volumetric attack, designed to overwhelm theserver by the quantity of requests being made There are a widerange of different attacks that can be made to achieve this objective,all of which aim to exploit weaknesses in networking protocols Tomitigate this, there are well-established, dedicated DDoS manage‐ment tools and services that can be put in place to minimize riskfrom DDoS attacks

A variation on this called application DDoS aims to make largenumbers of requests for certain, known pressure points within sys‐tems, with the intention of bringing the system to its knees This will

be discussed in more depth in Chapter 4

Security Vulnerability Exploits

These types of exploits involve attempts to make illegitimaterequests to a system with the aim of exploiting weaknesses withinthe security of a system allowing the operator to gain control overthe server or data within the application Common examplesinclude SQL injection and cross-site scripting

Hackers employ constant automated scripts that execute across theinternet looking for sites/servers where these vulnerabilities havenot been mitigated Well-managed servers and good applicationdevelopment can protect systems from these exploits, but it is also agood practice to use a web application firewall (WAF) to identifyand block illegitimate requests to further minimize risk from these

Trang 17

CHAPTER 2

Misconceptions of Automated Traffic

As we’ve already discussed, the amount of automated traffic is grow‐ing consistently and as it rises so too does the sophistication andcomplexity of the bot operators Before discussing the activities ofbot traffic in detail, it is worth addressing some of the common mis‐conceptions that website owners may have about automated traffic

Misconception: Bots Are Just Simple

Automated Scripts

While this may have been accurate 15 years ago, the level of sophis‐tication of bot traffic has been increasing massively as both thetechnology and platforms available to bot operators and the sophis‐tication of defenses in place increases and, most importantly, thegains to be achieved increase

Modern bots are sophisticated systems that will manage distribution

of traffic across large-scale environments or large botnets and viamultiple proxies in order to hide their activity among that of humanusers (even executing requests as a part of a human session) Botswill routinely execute requests from real browsers and execute Java‐Script sent to validate users as humans Detection mechanisms such

as CAPTCHA can be bypassed, either by using artificial intelligence

or brute-force systems, or by employing farms of human agents tosolve them on demand and pass the solution back to the bot Bots

7

Trang 18

are intelligent enough to integrate with these human services seam‐lessly.

Botnet

Botnets are networks of compromised computers

(usually infected by viruses or other malware) that can

be accessed remotely and used to execute any pro‐

cesses defined by the botnet operator Often this means

they are used to send requests to remote machines

over the internet

They are more commonly associated with being used

for DDoS attacks but can be used for automated traffic

(e.g., account takeover or card validation attempts)

There is an increasing number of botnets being made

available for hire

Multiple bot activities can be coordinated into a complete system.For example, data harvesting will be undertaken to get productdetails from a site to identify appropriate products to target, thencheckout abuse will be undertaken to create more valuable advertis‐ing subjects, and finally ad fraud will be undertaken—and all ofthese activities can be viewed and coordinated from a central controlpanel

Similarly, ticket touts will use spinner bots to hold a ticket and thentrigger another bot to automatically add this ticket to a secondaryticketing site When the ticket is sold the original bot will completethe purchase A central management system is in place to see thestatus of tickets being held/purchased and to handle distribution oftickets to end purchasers Additional software is used to then mod‐ify the downloaded tickets to reflect the new purchaser’s details.These are just some examples of the sophistication seen in bot activ‐ity and this level is increasing constantly to exploit weaknesses insystems, business logic, and practices and to stay ahead of thedefense mechanisms that are constantly being improved

8 | Chapter 2: Misconceptions of Automated Traffic

Trang 19

Misconception: Bots Are Just a Security

Problem

The challenge of managing automated traffic is often just dropped atthe door of an information security officer (ISO) and the securitydepartment, if the company has one For some types of automatedtraffic (such as credit card fraud) this makes absolute sense because

it is definitively a security issue and should be handled as such.However, some other types of automated traffic (such as price aggre‐gators) are actually business considerations and should be managed

as such by a relevant section of the business

There are a number of other roles that may be involved in makingdecisions about the varying types of and challenges raised by auto‐mated traffic These can include roles such as Head of Platform,Head of Ecommerce, Head of Ops, and Head of Marketing

The ideal management solution will provide sufficient information

to allow people in these roles to view details of and make informeddecisions about how to manage the elements of automated trafficspecific to their roles without being dependent on a black boxsecurity-based system

Misconception: Bot Operators Are Just

Individual Hackers

Obviously, we all know that there are extremely large organizationsthat operate automated traffic networks (think Google) and belowthat there are a group of organizations that are scraping data forlegitimate purposes (price aggregators, etc.) but beyond that there issometimes a sense of a distributed set of lone hackers developingsoftware to perpetrate scams or to sell to companies to spy on theircompetitors

While there is no doubt that such individuals exist, it is far from thetruth about all bot operators The amount of money that can bemade with some types of automated traffic means that they are, inreality, complex criminal organizations employing technical expertsand backed by human endeavor at an organizational, strategic leveland also at a lower level to complete manual tasks that are out of thescope of bot activity (e.g., completing CAPTCHAs)

Misconception: Bots Are Just a Security Problem | 9

Trang 20

There is also an increasing trend for the existence of third-partyservices that are focused on delivering automated traffic activity ondemand For example, there are a range of companies who offerprice/content scraping services on a per-use basis, and will provideall standard bot evasion techniques as standard (and they are con‐stantly working to improve the reliability of their evasion techni‐ques) This means that rather than your competitors building a pricescraping bot in house or by using a freelancer they now have access

to a service that is dedicated to evading bot detection in order tomaintain income Other third parties such as ticket bots, sneakerbots, and CAPTCHA farms are all being created to further increasethe sophistication of automated traffic being made available to usersboth legitimate and dubious (as well as end consumers, as is some‐times the case with sneaker bots)

Misconception: Only the Big Boys Need to Worry About Bots

There can sometimes be a feeling that there are two types of bots:

• Generic bots that are targeted at spotted untargeted weaknesses

in large numbers of sites

• Targeted bots that focus on specific, high-profile sites

This can lead to a false sense of security for website owners of sized sites—they might feel that, as long as they have some generalsecurity protection in place, then the bot operators are never going

mid-to go mid-to the effort of targeting their site

In reality, this is untrue: smaller sites tend to have fewer defenses, soare easier targets, and although solutions will need to be evolved to

be targeted to a specific site, this is often not as much work as might

be imagined The frameworks that have been built are sophisticated

to allow for easy expansion and the available resources are such that

a wide range of websites can be targeted

Small and mid-sized commercial online presences have been shown

to be equally targeted by automated traffic activity

10 | Chapter 2: Misconceptions of Automated Traffic

Trang 21

Misconception: I Have a WAF, I Don’t Need to Worry About Bot Activity

Web application firewalls (WAFs) are very useful tools that form afundamental part of a secure system They are similar to networkfirewalls, but rather than operating at a TCP/IP level, they operate atthe HTTP level to process all incoming requests and match eachrequest against a set of static rules, blocking requests that fail thechecks They are, therefore, very effective at stripping out vulnera‐bility scanning attempts such as SQL injection attacks

However, WAFs are not well suited for identifying bot traffic, as thechallenge of spotting automated traffic is fundamentally different.Basically, WAFs scan web traffic looking for illegitimate requestsdesigned to exploit security weaknesses in web applications, whereasbot detection systems need to scan web traffic looking for legitimaterequests that are aiming to exploit weaknesses in the business logic

of a web application Typically this involves making a judgment afteranalyzing the series of requests made to look for patterns of behav‐ior that differ from legitimate users (either human or good bot)

Misconception: I Have a WAF, I Don’t Need to Worry About Bot Activity | 11

Trang 23

CHAPTER 3

Impact of Automated Traffic

Before deciding on how to manage the automated traffic that is hit‐ting your system, it is important that you have effectively assessedthe impact it is having, weighed against the value it is delivering toyou When considering the impact you need to be sure that you arenot just considering the impact on your servers but also the businessimpact In addition, sufficient investigation must be undertaken todetermine the intent of the bot operator and to understand whatthey were actually trying to do when executing the automatedattack

It’s important to realize that non-human traffic can deliver valuewhile also having a negative impact on your business In this case,you must assess the relative importance of the non-human traffic todeduce whether the benefits of this traffic outweigh the negativeeffects

When assessing the impact consider the impact on company inter‐ests, other users, system security, and infrastructure Let’s nowexamine each of these in turn

Company Interests

Is the automated traffic accessing your site for purposes that wouldnot be in the interests of your company?

13

Trang 24

Examples of this include:

• Competitors who are scraping your prices so that they canadjust their pricing accordingly, putting them at a competitiveadvantage

• Bots stealing your content to use on their sites, saving them thecosts of creating that content or purchasing data feeds

• Spambots utilizing areas of your site that allow user-generatedcontent (UGC), such as comments or forums, to publish offen‐sive content or ads for services you would not want your com‐pany associated with

• Account takeover bots accessing people’s personal data for useelsewhere

• Scalpers purchasing limited availability goods for resale else‐where creating a negative public opinion of your brand

• Creation of fake accounts in order to take unfair advantage ofspecial offer terms

• Skewing of analytics and other metrics that would lead you tomake invalid business decisions

Trang 25

Is this traffic trying to bypass your system defenses in order to gainaccess to areas of the system that should not be publicly available,such as bypassing password-protected areas of the system to gainaccess to user’s personal/financial data or to steal credit associated tothat account.

As previously discussed, there is a whole range of security exploitsthat can be identified by security software that will regularly be scan‐ning your site These are outside the scope of this book but theimpact of allowing them to hit your site without appropriate man‐agement in place can be catastrophic, including complete loss ofcontrol of servers and compromise of data

Poor security can make your site a target for some of the other types

of automated traffic attacks described in this book, such as carding

or data theft A robust approach to security management is essential

to reduce the risk of reputational damage from a wide range ofpotential attacks

Infrastructure

Does the non-human traffic affect your infrastructure?

System performance can be negatively impacted by automated traf‐fic—for example, servers might reach capacity and therefore strug‐gle to return content or process requests in an appropriate manner.Alternatively, it could affect your scalability, meaning you hit limitssuch as disk space required for logs, cache, or database storage orsoftware licence limits much sooner than expected

All of these can further result in a negative impact on costs Thiscould be due to increased bandwidth usage because of the amount

of data being returned to automated processes, additional storagecosts, or additional infrastructure or software licences required torun the site

If you are scaling up your infrastructure to meet high demand fromautomated traffic and are not in a flexible cloud environment thenyou will be paying for a level of capacity far greater than that needed

to meet the business needs of the platform just to maintain userexperience during bot attacks

Infrastructure | 15

Trang 26

In many cases the savings associated with reduced infrastructureand bandwidth costs can be sufficient to justify investing in an auto‐mated traffic management solution.

This impact is intensified by the timing of the automated traffic inrelation to the peak hours seen by the business Search engines andother legitimate automated traffic that you may rely on will usuallywork with you to ensure that they are not conflicting with your peaktrading hours

16 | Chapter 3: Impact of Automated Traffic

Trang 27

PART II

Types of Automated Traffic

As discussed in Part I, there are a wide variety of activities that botsmay be accessing your system to carry out In some cases, theseactivities might be beneficial to you (or otherwise benign) but oftenthe intent is malicious When looking at your traffic to identify andmanage automated traffic it is essential that you understand theintent of this traffic and how that could impact you as a websiteowner

In many cases, even having identified that there is automated traffichitting your website, it can be difficult to understand what that traf‐fic is up to Only when you understand the intent is it possible to put

in place a management strategy to handle it Part of understandingthe intent is understanding the benefit the bot operators can achievefrom the actions they are undertaking

The chapters in this part will offer some depth in understanding ofthe most common types of automated traffic and provide back‐ground into the intent of that type of traffic

Trang 28

Alternative Categorization

There are a number of alternative categorizations that have beencompleted into automated traffic For example, OWASP’s Automa‐ ted Threat Handbook groups the threats into 21 different categories(these include security threats that are outside the scope of thisbook)

To try and further simplify things I have categorized all types of bottraffic under nine broad headings:

• Malicious bots

• Data harvesting

• Checkout abuse

• Credit card fraud

• User-generated content (UGC) abuse

• Account takeover

• Ad fraud

• Monitoring

• Human-triggered automated traffic

Chapters 4 through 12 discuss each of these categories in moredetail

Trang 29

CHAPTER 4

Malicious Bots

This category includes bot activity that is designed simply to have anegative impact on a website rather than the negative impact being aby-product of another activity designed to benefit the bot operatordirectly

This bot activity has a lot of overlap with other types of securityattacks, such as DDoS attacks or vulnerability exploits (as alreadymentioned, these are not covered at length here, as there is a wealth

of literature on these topics) Attacks like this are usually orchestra‐ted by groups wanting to hold companies to ransom in return forstopping the attacks, groups who have ideological objections tocompanies’ activities, or occasionally, malicious competitors

Traditionally the use of automated traffic as we define it here has notbeen the means of malicious attack but this will change as defenseagainst DDoS attacks and other security systems improve

Trang 30

Application DDoS

The objective of a DDoS attack is to undertake a large amount ofactivity such that the server under attack is unable to provide theservice that it is in place to provide While this is partly done by thequantity of traffic, it is also done by forming the network requests insuch a manner as to exploit weaknesses in the network protocolsthat make a failure more likely As a very simple example, this could

be simply opening network connections to a server and then keep‐ing that connection open with minimal interaction until the serverruns out of available connections

Application DDoS takes a similar approach but at an applicationlevel Rather than exploiting weaknesses in the network protocol, itlooks for areas of application functionality that will struggle whenthe application is under load These could be areas that involve highprocessor usage, integration with third-party systems, or complexdatabase activity Often these will be areas such as search, log in,availability checks, or real-time booking requests but will vary witheach website The bot traffic will then just automate repeatedrequests to these areas of the website until the site reaches a limitand falls over or is unable to transact normally with legitimate cus‐tomers

These attacks are usually well hidden, rotating IP addresses andlegitimate user agents and are often launched via botnets

20 | Chapter 4: Malicious Bots

Trang 31

CHAPTER 5

Data Harvesting

This category captures a range of traffic types that will access thepublicly available information contained within your website andcapture that data for use elsewhere Typically this will involve access‐ing many pages and extracting relevant data using text patternmatching

Good or Bad?

Data harvesting covers the full range of motives, from good to bad.Some data harvesting bots, such as search engine spiders, areclearly usually regarded as good, with many businesses depending

on these as the source of their traffic Likewise, affiliates would typi‐cally be driving traffic to your site

Conversely, there are data harvesting bots that are clearly bad, such

as those engaged in content theft or price scraping

There are also data harvesting bots that exist in the gray areabetween good and bad—for example, price and content aggregatorsthat may be legitimately driving traffic to your site but in a way thatmay not be in the interests of your business

Let’s now look at some specific examples of data harvesting bots

21

Trang 32

Search Engine Spiders

The most common form of data harvesting and the one withoutwhich the internet as we know it today wouldn’t function is thesearch engine spider The most common of which is, of course,GoogleBot, but there are many others from a range of global as well

as regional or specialist search engines These bots will usually enteryour site via the homepage or via a deep link from another site andthen follow all active links until they have accessed all pages withinthe website Each request leads to multiple other requests, hence thename spider

Search engine spiders are generally well behaved, legitimately identi‐

fying themselves in a user-agent string, following robots.txt instruc‐

tions, not attempting to bypass any security mechanisms and notusing your data for anything beyond populating their search resultsand therefore driving traffic to your system They can, however,sometimes be aggressive in the rate of requests that they make, and

as they will often make thousands of requests, this can create term pressure on underlying infrastructure Some search enginesallow you to define the rate of request that is applied to your website

short-to mitigate this impact

Most people welcome search engine spiders, and in fact, put changes

in place to optimize their sites for the needs of search engines as aprimary driver of traffic, so typically this will not be a type of trafficthat you will want to take any action against If the overhead is toohigh, however, perhaps some action could be taken against regionalsearch engines—for example, if your business does not operate inChina, you might take action against the Chinese search engineBaidu

Any action taken to optimize the experience of this importantsource of automated traffic must be taken very carefully as mostsearch engines, especially Google, are on the lookout for the practice

of cloaking—that is, adjusting the experience and content returned

to spiders so that it differs from that seen by real users visiting thesame page, as this undermines the accuracy of their search results

Trang 33

where, without the consent of the website owner Content, in thisrespect, can be either content you have created yourself (journalism,opinion pieces, thought leadership, etc.) or content that you haveextracted from paid data feeds to display within your website (e.g.,sports statistics, product information, etc.).

In some situations content theft involves bypassing a paywall, andthere are several methods employed to do this

First, by abusing the Google First Click Free policy, which said that

the first three articles clicked through to your site from Googleshould be free and only subsequent clicks will activate the paywall.Bot traffic can generally bypass the methods put in place to enforcethis process Google has recently relaxed this policy to allow paywallsites a wider set of options for how they can integrate with Googlesearch results

Second, by creation of fake accounts to take advantage of free trialperiods (see “Bonus Abuse” on page 39)

And finally, by logging into a legitimate account by cracking some‐one’s username and password (a full discussion of account takeover

is presented in Chapter 9)

After the data is harvested, these bots can engage in a number of dif‐ferent activities, including the following:

• Using your content within their own competitive site to provide

a similar experience to their customers without the cost of cre‐ating or purchasing the data

• Using the content within a scam version of your site that usestheir own advertising in place of yours This can include spe‐cialist browser plug-ins that will intercept requests for your sitecontent and substitute it with the alternative site content Thesesites may include advertising for goods and services that youwouldn’t want your brand associated with, so in addition to los‐ing customers you could also experience brand damage

• Distributing your content to wider groups of consumers than isallowed in terms of use (typically, this is more common whenthere are paywalls in place on the site being poached from).Although content theft bots are conceptually similar to spiders theyare much less well behaved They will usually disguise themselves ashuman users typically using a browser user agent and will ignore

Content Theft | 23

Trang 34

instructions defined in robots.txt Whereas spiders are designed to

extract data from all sites, content theft is usually much more targe‐ted, being tuned to extract specific content from target websites.Content theft bots will often employ more sophisticated methodolo‐gies, to evade protections put in place such as rotating IP addressesand varying request rates and intervals, to evade whatever protec‐tions you have put in place

Price Scraping

A more specific form of content theft is price scraping (or odds scrap‐

ing in the gambling industry) This involves the extraction from a

website of specific data relating to the pricing of goods (or equiva‐lent, such as the odds being offered for placing bets on a specificoutcome) Price scraping is often undertaken by competitors toimplement an effective price matching strategy, in fact many compa‐nies make public declarations that they will always be the cheapestprice available, sometime even displaying the prices of competitors

on their product pages

Sophisticated price matching strategies take account of the availabil‐ity of products on competitor sites and adjust pricing accordingly—that is, offering goods at a higher price when they are unavailableelsewhere and dropping the price when competitors have availabil‐ity Constant price scraping also allows for real-time reaction tocompetitor discounting and special offers The more sophisticatedecommerce platforms will automate the price adjustments theymake based on the data received from competitor price scraping,meaning that any discount applied by a competitor can be matchedvery quickly with no need for human interaction

Price scrapers, to be effective, must take all actions they can to avoiddetection as the obvious defense against them would be to startresponding with incorrect data, meaning your competitors are mak‐ing invalid pricing decisions

While traditionally these were usually internally developed pro‐grams aimed at specific competitors, they are increasingly beingprovided by specialist third parties as a service

24 | Chapter 5: Data Harvesting

Trang 35

Content/Price Aggregation

This is a particular variant on content theft and price scraping, whenthe data being harvested is used to gather together groups of similartypes of data to display in a single place for the benefit of users.Examples of this would be price comparison sites where the pricesfrom a range of different sites for the same product are gathered anddisplayed to the user, allowing them to make a buying decisionwithout needing to visit many different sites

When run ethically, these sites will display data as extracted fromyour site accurately and attribute the source of the data, with theobjective being to drive users to your site to purchase the product orview the full content In this sense they would argue that they are apositive benefit for you as they will drive more users to your site.These types of service will usually not make any attempts to hidetheir activities, including using accurate user-agent identificationand should be willing to interact with an API, if you have one avail‐able, rather than scraping your site or to remove your site from theirlist of sites they are aggregating at your request

However, there is an evolution of these sites where, though they areaggregating information, they are less open about the source of thatinformation This can include content sites where they correctlyattribute the source of the content but display the content in fullwithin their site but without linking through to the source of thecontent or displaying advertising

Alternatively, this could be sites that aggregate the prices for goodsand then complete the purchase of goods via your website but withthe appearance that the purchase was via the third-party site Thismay seem as if it is beneficial but can present several issues First, itremoves the possibility of providing any up-selling opportunities.Second, there is no assurance that any promises or product descrip‐tions made ahead of the purchase are accurate Lastly, it removescontrol about the sale of your goods from the strategy that you havedefined as a company

This type of site is most common in the travel industry where cer‐tain airlines do not want to be part of any price comparison sites butthose sites will scrape their content and embed it without permis‐sion There are several examples of Ryanair successfully taking legalactions against sites who have used their data in this way

Content/Price Aggregation | 25

Trang 36

Abuse of affiliate systems is discussed in Chapter 10.

User Data Harvesting

Sometimes content is harvested from your site, not for your data but

to cull personal data from the user-generated content areas of yoursite, such as reviews, comments, discussions, pictures, and so on.This data can be valuable to bad actors who are trying to build pro‐files of individuals, either to undertake some negative actionsagainst those particular users or as part of taking advantage of theinformation contained in the online profiles to undertake otheractivities such as ad fraud where a user with a similar profile would

be more valuable

26 | Chapter 5: Data Harvesting

Trang 37

CHAPTER 6

Checkout Abuse

Checkout abuse is when automated traffic looks to bypass or inother ways manipulate the checkout process on ecommerce sites togain personal or business advantage This will involve automatingthe process of interacting with website ecommerce processes, such

as adding to baskets and completing checkout processes As theseprocesses vary from site to site the process has to be created andimplemented for each site targeted

These are segmented from credit card abuse bots because these are

activities that are not, in themselves, aimed to defraud the websiteowner (that is not to say that they are not violating some legalrestrictions or terms and conditions of the website)

Good or Bad?

From a business point of view, most prominent examples of check‐out abuse could be seen as positive as they will effectively completesales quickly and efficiently (this is assuming that the purchase iscompleted using valid and legal payment means) Large amounts ofcheckout abuse can result in much higher throughput of sales forhigh-demand items

The negative side of checkout abuse can primarily be judged by theimpact that it has on other users and therefore the negative branddamage that can be associated with this There is often customerbacklash, which is then picked up by the media, when items aresold out in very short periods of time to non-human traffic, espe‐

27

Trang 38

cially as those items are often made available for sale via re-sale sitesfor vastly marked up prices.

Some of the more complex variants of this, such as inventory grab‐bers and spinners, have such clearer negative impact on businessinterests

We’ll now look at some specific types of checkout abuse

Scalpers

Scalpers are automated processes that will capture goods and com‐plete the checkout process in a fully automated fashion The mostcommon example of this is in event ticketing (where high demandtickets are purchased for re-sale at inflated prices), but it is alsobecoming increasingly common in the fashion industry as labelsrelease limited edition items that are of interest to collectors

Scalpers take advantage of the fact that they can complete processesmuch more quickly than human users so have a much higher chance

of being successful during busy periods They will also usuallylaunch many separate attempts to complete transactions

These bots are traditionally difficult to identify as they will disguisethemselves as human users, using human user-agent identificationand following the same process as human users They are also onlymaking a relatively small number of requests, unlike scrapers thatare making hundreds or thousands of different requests These kind

of low-volume attacks are easy to hide within the general highthroughput of a busy on sale

Big Business

The secondary ticketing market is estimated at $15 bil‐

lion in the United States alone, with some estimates

saying that 70% of tickets sold via these platforms are

sold by touts or professional traders A separate UK

government report estimated that only 20%–40% of

tickets for a major concert would be sold directly to

Trang 39

include not only the automation of the purchase process (includingbypass of protections such as CAPTCHA) but also the tracking ofticket purchases and automatically posting to re-sale sites and themanipulation of the PDF ticketing afterward to reflect the details ofthe new purchaser.

There has also been a recent growth in online services that allowindividuals to pay for the service to attempt to purchase specificitems from a limited release sale via the use of automated processes.Ticket scalpers are more of a publicity and brand protection concernthan a profitability concern, with users getting unhappy becausethey miss out on ticket purchases, especially when the same ticketsare available for re-sale at a marked-up price Such situations lead tobad publicity and often the artist whose tickets are involved willrespond to the situation, including on some occasions cancellingindividual tickets seen as available for re-sale or even completelycancelling shows This in turn leads to demands for political solu‐tions and regulation to be put in place in several countries aroundthe world

For this reason there are a lot of innovations in the ticketing indus‐try that aim to make the transfer of tickets after purchase muchmore difficult It is yet to be seen whether these methods (or legalchanges) will manage to bring this situation under control Theseissues and responses are becoming more common in other indus‐tries

Spinners

Spinners are an evolution of scalping bots but are a more insidiousversion Rather than completing the purchase process they hold thegoods in a basket or equivalent, knowing that the product willremain assigned to them until the transaction is completed Whileholding it in this state they advertise the product for re-sale onanother site and, only if the product is re-sold do they complete theinitial transaction

The most common use of spinners is for ticket purchases, meaningthat ticketing sites are reporting shows as sold out before all the tick‐ets have actually been sold, meaning that the touts do not have totake the risk of paying for tickets up front until they know they have

a buyer lined up

Spinners | 29

Trang 40

Spinner bots will be specifically created for target systems to takeadvantage of weaknesses in the business logic, which has beendesigned to prevent users from losing out on tickets.

Inventory Exhaustion

Inventory exhaustion (also called inventory grabbers) is an even less

palatable version of a spinner bot Like a spinner bot it aims to cap‐ture inventory within baskets and hold it there, but unlike a spinnerbot, it does so with no intention of completing a purchase

In this case the intention is simply to grab limited availability goodsand remove them from availability for anyone else, eventually lead‐ing websites to report those products as out of stock This will typi‐cally be implemented by competitors who have identical productsavailable, usually at a higher price Rather than price scraping andensuring that they are selling at a lower price than their competitor,they force their competitor to report items as out of stock and carry

on selling at a higher price This is often combined with other activi‐ties such as affiliate fraud (discussed in Chapter 10)

This behavior is clearly not in the interest of either website owner orconsumer

Snipers

Snipers are automated processes that monitor time-based onlineprocesses and submit information at the very last moment, remov‐ing the opportunity for other people to respond to that action Themost common example is last-second bidding on an auction item.While it is true that a human user could manually carry out thissame action, usually automated processes can complete it closer tothe deadline, beating a human competitor

From a site owner’s perspective, this activity creates two issues:

• It reduces the level that an auction could possibly have reached

if bidding had been carried out without sniping

• It is usually seen as being unfair competition by human users ofthe site who will struggle to win an auction

30 | Chapter 6: Checkout Abuse

Ngày đăng: 12/11/2019, 22:32

🧩 Sản phẩm bạn có thể quan tâm