Information extraction from dynamic web sources

In thisthesis, we investigate the wrapper reinduction problem and develop a novel algorithm thatcan detect layout changes and reinduce wrappers automatically.. We demonstrate that this a

Trang 1

INFORMATION EXTRACTION FROM DYNAMIC WEB SOURCES

ROSHNI MOHAPATRA

NATIONAL UNIVERSITY OF SINGAPORE

2004

Trang 2

INFORMATION EXTRACTION FROM DYNAMIC WEB SOURCES

ROSHNI MOHAPATRA

(B.E.(Computer Science and Engineering), VTU, India)

A THESIS SUBMITTEDFOR THE DEGREE OF MASTER OF SCIENCEDEPARTMENT OF COMPUTER SCIENCENATIONAL UNIVERSITY OF SINGAPORE

2004

Trang 3

First of all, I would like to express my sincere thanks and appreciation to my supervisor,

Dr Kanagasabai Rajaraman for his attention, guidance, insight, and support which hasled to this dissertation Through his ideas, he is in many ways responsible for much of thedirection this work took

I would also like to thank Prof Sung Sam Yuan and Prof Vladimir Bajic who havebeen a source of inspiration to me I am grateful to Prof Kwanghui Lim, Deparment ofBusiness policy, NUS School of Business for being a mentor and friend, and listening to myfrequent ramblings

I would like to acknowledge the support of my thesis examiners: A/P Tan Chew Limand Dr Su Jian I greatly appreciate the comments and suggestions given by them

Ma, Papa continue to pull the feat of helping me with my work without caring to knowthe least about it I would like to thank them and the rest of the family for their love,support and encouragement Special thanks to Arun for his patience, support, favors andall the valuable input for this thesis and otherwise Finally, a big thank you to all myfriends, wherever they are, for all the good times we have shared that has helped me tocome till here

Trang 4

1.1 Background 1

1.2 Information Extraction from the Web 5

1.2.1 Wrappers 6

1.2.2 Wrapper Generation 8

1.3 Organization 10

2 Survey of Related Work 11 2.1 Wrapper Verification Algorithms 13

2.1.1 RAPTURE 14

2.1.2 Chidlovskii’s Algorithm 16

2.2 Wrapper Reinduction Algorithms 17

2.2.1 ROADRUNNER 18

2.2.2 DataProg 19

2.2.3 SG-WRAM 21

Trang 5

2.3 Summary 22

3 ReInduce: Our Wrapper Reinduction Algorithm 25 3.1 Motivation 25

3.2 Formalism 28

3.3 Generic Wrapper Reinduction Algorithm 30

3.3.1 Our Approach 31

3.3.2 Algorithm ReInduce W 32

3.4 Incremental Wrapper Induction 33

3.4.1 LR Wrapper Class 33

3.4.2 LR Wrapper Class 42

3.4.3 LR RE Wrapper Class 46

3.5 Summary 50

4 Experiments 52 4.1 Performance of Induce LR 53

4.1.1 Sample cost 53

4.1.2 Induction cost 55

4.2 Performance of ReInduceLR 56

Trang 6

List of Tables

3.1 Algorithm ReInduce W 32

3.2 Algorithm InduceLR 39

3.3 Trace of Algorithm InduceLR for Page P A 40

3.4 Algorithm InduceLR 44

3.5 Expressiveness of LR RE 48

3.6 Algorithm InduceLR RE 49

3.7 Algorithm ExtractLR RE 50

4.1 Websites considered for evaluation of Induce LR 53

4.2 Details of the webpages 53

4.3 Precision and Recall of InduceLR 54

4.4 Time complexity of InduceLR 55

4.5 White Pages Websites considered for evaluation of ReInduce LR 57

4.6 Performance of ReInduce LR 57

4.7 Average Precision and Recall for Existing Approaches 58

Trang 7

List of Figures

1.1 Froogle: A product search agent 2

1.2 Weather listing from the channel news asia website 5

1.3 Page P A and HTML Source. 7

1.4 Page P B and HTML Source. 9

2.1 Life Cycle of a Wrapper 12

2.2 Layout changes in an Online Address Book 13

2.3 Content changes in a Home supplies page 15

2.4 Changed Address Book Address m 16

2.5 User defined schema for the Address Book Example 21

2.6 Content Features of the Address field 22

3.1 Incremental Content changes in Channel News Asia Website 27

3.2 ReInduce: Wrapper Reinduction System 28

3.3 Page P A and its corresponding LR Wrapper 34

3.4 Illustration of LR Constraints 37

3.5 HTML Source for Modified Page P A m 43

3.6 Page P A and corresponding LR and LR RE wrappers 47

Trang 8

3.7 HTML Source for Modified Page P 0

A 48

A.1 Screenshot from Amazon.com 61

A.2 Screenshot from Google.com 62

A.3 Screenshot from uspto.gov 62

A.4 Screenshot from Yahoo People Search 63

A.5 Screenshot from ZDNet.com 63

Trang 9

To organize, analyze and integrate information from the Internet, many existing systemsneed to automatically extract the content from the webpages Most systems use customized

wrapper procedures to perform this task of information extraction Traditionally wrappers

are coded manually, but hand-coding is a tedious process A technique known as wrapperinduction has been proposed for automatically learning a wrapper from a given resource’s

example pages In both these methods, the key problem is that, due to the dynamic nature

of the web, over a period of time, the layout of a website may change and hence the wrappermay become incorrect The problem of reconstructing a wrapper, to ensure continuousextraction of information from dynamic web sources is called wrapper reinduction In thisthesis, we investigate the wrapper reinduction problem and develop a novel algorithm thatcan detect layout changes and reinduce wrappers automatically

We formulate wrapper reinduction as an incremental learning problem and identify thatwrapper induction from an incomplete label is a key problem to be solved We observethat the page content usually changes only incrementally for small time intervals, thoughthe layout may change drastically and none of the syntactic features are retained We

thus propose a novel algorithm for incrementally inducing a class of Wrappers called LR

wrappers and show that this algorithm asymptotically identifies the correct wrapper as the

number of tuples is increased This property is used to propose a LR wrapper reinduction

algorithm We demonstrate that this algorithm requires examples to be provided exactlyonce and thereafter the algorithm can detect the layout changes and reinduce wrappers

automatically, so long as the wrapper changes are in LR.

We have performed experimental studies of our reinduction algorithm using real web

Trang 10

pages and observed that the algorithm is able to achieve near perfect performance Incomparison, DataProg has reported performance of 90% precision and 80% recall, and SG-WRAM 89.5% precision and 90.5% recall However, DataProg and SG-WRAM assume thatthe content to be extracted follows specific patterns, which is not required by our algorithm.Furthermore, our algorithm has been observed to be efficient and capable of learning from

a small number of examples

Trang 11

Information Integration systems deal with extraction and integration of data from

various sources [4] An application developer starts with a set of web sources and creates aunified view of these sources Once this process is complete, an end user can issue database-like queries as if the information were stored in a single large database [30] Many suchapproaches have been discussed in [4, 9, 13, 25]

Trang 12

Figure 1.1: Froogle: A product search agent

Intelligent agents or Software agents is a term used to describe fully autonomous systems perform the process of managing, collating, filtering and redistributing information

from the many resources [36, 8] Broadly put, agents would include Information integrationalso as a task, but they additionally analyze the information obtained from various sources.These systems assist users by finding information or performing some simpler tasks ontheir behalf For instance, such a system might assist in product search to aid onlineshopping Froogle, shown in Figure 1.1 is one such agent Some agents help in web browsing

by retrieving documents similar to already-requested documents [47] and presenting crossreferenced scientific papers [40] More commercial uses have been proposed: ComparativeShopping Agents [19], Virtual Travel Assistants [2, 22], and Mobile Agents [10] Manysuch agents are deployed and listed online and provide a wide range of functionality Acomprehensive survey of such agents has been done in [47]

Trang 13

A class of these agents help tackle the information overload on the internet by assisting

us finding important resources on the Web [11, 37, 26, 54], and also track and analyze theirusage patterns This process of discovery and analysis of Information on the World Wide

Web is called Web Mining Web mining is a huge, interdisciplinary and very dynamic

scientific area, converging from several research communities such as database, informationretrieval, and artificial intelligence especially from machine learning and natural languageprocessing This includes automatic research and analysis of information resources available

online, Web Content Mining, discovery of the link structure of the hyperlinks at the document level, Web Structure Mining, and the analysis of user access patterns, Web Usage Mining[17] A taxonomy of Web Mining tools has been described in [17] A detailed survey

inter-on Web Mining has been presented in [31]

Originally envisioned by the World Wide Web Consortium (W3C) to evolve, proliferate,and to be used directly by people, the initial web architecture included a simple HTTP,URI, and HTML source structure People used query mechanisms (e.g, HTML forms)and received output in the form of HTML pages This was very well suited for manualinteraction As expected, the existence of an open and freely usable standard allows anyone

in the world to experiment with extensions; the HTTP and HTML specifications have bothgrown rapidly in this environment HTML pages now contain extensive, detailed formattinginformation which is specific to one type of browser, and many vendor-specific tags havebeen added, making it useful only as a display language without any standard structure

Now efforts are underway to standardize and incorporate more structure into the web.The advent of XML has helped tackle this lack of structure in the web, but it is not ascommonly used and there is very limited native browser support Though many websites

Trang 14

employ XML in their background, there are still many HTML-based websites that need to

be reverted to XML before universal adoption Thus, there is still a need to convert fromthe existing HTML data to XML, and technology does not provide a trivial solution forthis [39]

The Web which is characterized by diverse authoring styles and content variations,does not have a rigid and static structure like relational databases Most of the pagesare composed of natural language text, and are neatly ‘formatted’ with a title, subtitles,paragraphs etc, more or less like traditional text documents But we observe that, the webdemonstrates a fair degree of structure in data representation [20] and is highly regular

in order to be human-readable and is often described as semi-structured Normally, a document may contain its own metadata, but the common case is for the logical structure

to be implicitly defined by a combination of physical structure (e.g HTML tags for a page, line and paragraph boundaries for a free-text resource) and content indicators (e.g.

web-words in section headings, capitalization of important web-words, etc.) [55]

For example, a webpage listing the world weather report may list the results in theform of a tuple (city, condition, max temperature, min temperature) as shown in the Figure1.2 Many such tuples may be present on the same page, appropriately formatted, giving itthe appearance of a relational database Similarly, a movie listing may have the information

in the order (movie, rating, theater, time)

While unstructured text may be difficult to analyze, semi-structured text poses a ferent set of challenges It is interlaced with extraneous elements like advertisements and

dif-HTML formatting constructs [33] and hence, extraction of data from Web is a non-trivial

Trang 15

Figure 1.2: Weather listing from the channel news asia website

problem

The primary problem faced by Information Integration systems and Intelligent agents

is not resource discovery, since most of them would look at a few trusted sources related to

specific domains Since the semi-structured pages contain a lot of extraneous information,the problem exists in being able to extract the contents of a page Kushmerick et al.[36]

advocate this task of Information extraction from the web as the core enabling technology

for a variety of Information agents

1.2 Information Extraction from the Web

At the highest level, this thesis is concerned with Information Extraction from the Web

Information Extraction (IE) is the process of identifying the particular fragments of an

information resource that constitute its core semantic content [34] A number of IE systemshave been proposed for dealing with free-text (see [56, 12] for example) and semi-structuredtext [6, 18, 33, 55]

Trang 16

The Information extraction algorithms can be further classified on the basis of whetherthey deal with semi-structured text and semi-structured data [39] Note that, in the formerthe data can be only inferred and in latter the data is implicitly formatted The focus inthis thesis is on semi-structured data extraction A taxonomy for these Web data extractionmethods has been described in the detailed survey by Laender et al [39] In this survey, theexisting methods are classified into Natural language processing (NLP), HTML structureanalysis, Machine Learning, data modeling and ontology-based methods.

To extract information from semi-structured information resources, information extraction

systems usually rely on extraction rules tailored to a that source, generally called Wrappers.

Wrappers are software modules that help capture the semi-structured data on the web into

a structured format They have three main functions [32]:

• Download: They must be able to download HTML pages from a web site.

• Search: Within a resource they must be able to search for, recognize and extract

specified data

• Save: They should Save this data in a suitably structured format to enable further

manipulation The data can then be imported into other applications for additionalprocessing

According to [5], 80% of the published information on the WWW is based on databasesrunning in the background When compiling this data into HTML documents the structure

Trang 17

Figure 1.3: Page P Aand HTML Source.

of the underlying databases is completely lost Wrappers try to reverse this process byrestoring the information to a structured format

Also, it can be observed that across different web sites and web pages in HTML, thestructural formatting (HTML tags or surrounding text) may differ, but the presentationremains fairly regular Wrappers also help in coping with structural heterogeneity inherent

in many different sources By using several wrappers to extract data from the various mation sources of the WWW, the retrieved data can be made available in an appropriatelystructured format [32]

infor-To be able to search data from semi-structured web pages, the wrappers rely on key patterns that help recognize the important information fragments within a page The most

challenging aspect of Web data extraction by wrappers is to be able to recognize the dataamong uninteresting pieces of text

For example, consider an imaginary web site containing Person Name and CountryName entities, shown in Figure 1.3 To extract the two entities, we can propose a wrapper,

say P CW rapper, using the delimiters {hBi, h/Bi, hIi, h/Ii}, where the first two define the

left and right delimiters of the Person Name and the last two define the correspondingdelimiters for Country Name This wrapper can be used to extract the contents of the page

Trang 18

P A, and of any other page, where the same features are present.

One approach to creating a wrapper would be hand-code them [24] but it is a tediousprocess Techniques have been proposed for constructing wrappers semi-automatically orautomatically, using a resource’s sample pages The automatic approaches which use super-vised learning need the user to provide some labeled pages indicating the examples Manysuch approaches were proposed in RAPIER [12], WHISK [56], WIEN [33], SoftMealy [28],STALKER [50] and DEByE [38] A method for automatic generation of Wrappers withunsupervised learning was introduced in RoadRunner [18]

To extract the data wrappers either use content based features or landmark based tures Content based approaches [12, 56] use content/linguistic features like capitalization,

fea-presence of numeric characters etc and are suitable for Web pages written in free text,

possibly using a telegraphic style, like in job listings or rental advertisements Landmark based approaches [33, 28, 50] use delimiter based extraction rules that rely on formatting

features to delineate the structure of data found [39] and hence are more suitable for dataformatted in HTML

For example, in Figure 1.3, the Wrapper P CW rapper can be learnt automatically from

examples of (Person Name, Country Name) tuples

Since the extraction patterns generated in all these systems are based on content ordelimiters that characterize the text, they are sensitive to changes of the Web page format

Trang 19

Jonathan, UK

Figure 1.4: Page P B and HTML Source.

In this sense they are source-dependent They either need to be reworked or need to bererun to discover new patterns for new or changed source pages For example, suppose thesite in Figure 1.3 changes to a new layout as in Figure 1.4

Note that P CW rapper no longer extracts correctly It will extract the tuples as

(China, James), (India, John), (USA, Jonathan) rather than (Jack, China), (James,India), (John, USA), (Jonathan, UK)

Kushmerick [16, 18] investigated 27 actual sites for a period of 6 months, and foundthat 44 % of the sites changed its layout during that period at least once [35] If the sourcemodifies its formatting (for example, to “revamp” its user interface) the observed content

or landmark feature will no longer hold and the wrapper will fail [36] In such cases, theextraction of data from such web pages becomes difficult and is clearly a non-trivial problem

In this thesis, we focus on this problem of Extraction of Information from namic Web sites We deal with dynamic web pages, typically, a web page which ismodified in its layout, content or both The challenge here is to generate the Wrapperautomatically when the page changes occur, such that the data is extracted continuouslyfor the purpose of the user In this thesis, we develop systems that are capable of extract-

Trang 20

Dy-ing the content of such dynamic webpages We propose a novel approach for dealDy-ing withdynamic websites and present efficient algorithms that can perform continuous extraction

of information

1.3 Organization

The rest of the thesis is organized as follows:

Chapter 2 is dedicated to reviewing all the existing literature for information extractionfrom dynamic websites and evaluating their strengths and weaknesses We summarizethe key learning from these methods and present the scope of our work

Chapter 3 presents a detailed description and analysis of our approach We formally definethe problem of information extraction from dynamic websites, and our approach totackling it We discuss the formal framework for our algorithm, and define and analyze

in detail the wrapper classes We also present a study and analysis of algorithms tolearn these wrappers, and use them to propose a novel method to learn new wrappers

on the fly when layout and content changes occur in the website

Chapter 4 discusses the empirical evaluation of our work through experiments on realwebpages We study the sample and time complexity of our algorithms and comparethe results to the existing approaches

Chapter 5 summarizes our work and indicates the merits as well as limitations We pose the ways to extend the algorithms to achieve better performance and also posethe open problems for further investigation

Trang 21

pro-Chapter 2

Survey of Related Work

As discussed in the previous chapter, Wrappers are software modules that help us capturesemi structured data into structured format We noted that these wrappers are susceptible

to “breaking”, when the website layout changes happen To rectify this problem, a newwrapper needs to be induced using examples from the modified page This is called the

Wrapper Maintenance problem and it consists of two steps [36, 42]:

1 Wrapper Verification: To determine whether a wrapper is correct.

2 Wrapper Reinduction: To learn a new wrapper if the current wrapper has become

incorrect

The entire process of a Wrapper Induction, Verification and Reinduction is illustratedthrough Figure 2.1 [42]

Trang 22

Wrapper Verification Automatic

relabeling

Wrapper Induction

Reinduction System

Labeled pages

Pages

Extracted Data

Change Detected

Figure 2.1: Life Cycle of a WrapperThe wrapper induction system takes a set of web pages labeled with examples of thedata to be extracted The output of the wrapper induction system is a wrapper, consisting

of a set of rules to identify the data on the page

A wrapper verification system monitors the validity of data returned by the wrapper

If the site changes, the wrapper may extract nothing at all or some data that is not correct.The verification system will detect data inconsistency and notify the operator or automati-cally launch a wrapper repair process A wrapper reinduction system repairs the extractionrules so that the new wrapper works on changed page

We take a simple example to illustrate this Consider, the example given in Figure 2.2

The Wrapper Address wrap for page Address o is the same as P CW rapper in the previous chapter: {,,,}

When the page Address o changes its layout to Address c , Wrapper Address wrapwouldextract (12 Orchard Road, James), (34 Siglap Road, June), (22 Science Drive, Jenny)

Trang 23

<TITLE>Address Book</TITLE><BODY>

Jack,1234 Orchard Road 

James,3454 Siglap Road 

John,22 Alexandra Road 

Jonathan,1156 Kent Ridge 

</BODY></HTML>

(a)Original Address Book Address o

<HTML>

Jack,12 Orchard Road 

James,34 Siglap Road 

June,22 Science Drive 

Jenny,11 Sunset Blvd 

</BODY></HTML>

(b) Changed Address Book Address c

Figure 2.2: Layout changes in an Online Address Book

on page Address c The wrapper verification system will identify that the extracted data is

incorrect The Wrapper reinduction system will help learn the new Wrapper: {,,,}.

Wrapper Maintenance has been investigated in literature Below, we review the

im-portant works and discuss the strengths and limitations of these methods

2.1 Wrapper Verification Algorithms

Wrapper Verification is the step to determine whether a wrapper is still operating correctly.

When a Web site changes its layout or in the case of missing attributes, the wrapper will

either yield NULL results, or a wrong result In such a case, the wrapper is considered to

Trang 24

be broken This can become a big bottleneck for information integration tools and also for

information agents

Kushmerick [35] proposed a method for wrapper verification using a statistical approach

He uses heuristics like Word count and mean word length The method relies on obtainingheuristics for the new page, and comparing it against the heuristic data for pre-verifiedpages to check whether it is correct An outline of the steps is given below:

• Step 1: Estimating the number of tuple distribution parameters for pre-verified pages.

This is assumed to follow an normal distribution The mean tuple number and thestandard deviation is also computed

• Step 2: Estimating the feature value distribution parameters for each attribute in the

pre-verified pages For this simple statistical features like word-count and word lengthare used For example, the word count for ‘Jonathan’ is ‘1’ and for ‘1156 Kent Ridge’

it is 3, and mean word length for the name field is 5.25

• Step 3: For any page, a similar computation of tuple distribution, feature value

dis-tribution is done These values are compared against the values for the pre-verified

pages For example, for feature 1 (Name), in Address o from our computation, we

know that the average word length is 1, but in Address c it is computed to be 3

• Step 4: Based on step 3, Computation of the overall verification probability is done.

This probability is compared against a fixed threshold to determine whether the per is correct or incorrect In case of our example, it would return a CHANGED

Trang 25

wrap-Item List Price Our PriceChopsticks $6.95 $4.95Spoons $25.00 $10.00

(a) Original content, Home o

Item List Price Our PriceChopsticks $6.95 $3.95Spoons $25.00 $5.00

(b) Modified content, Home c

Figure 2.3: Content changes in a Home supplies pageStrengths: For most part, this method uses a black-box approach to measuring overallpage metrics and hence it can be applied in any wrapper generation system for verification.RAPTURE uses very simple numeric features to compute the probabilistic similarity mea-sure between a wrapper’s expected and observed output After conducting experimentswith numerous actual Internet sources, the authors claim RAPTURE performs substan-tially better than standard regression testing approaches

Weaknesses: Since information for even a single field can vary considerably overallstatistical distribution measures may not be sufficient For example, in case of Listingsfor scientific publications, the author names and the scientific publication names all mayvary too drastically leading to ambiguity while verification Such cases though rare, makethis approach ineffective, unless more features are used while verification like digit density,upper case density, letter density, HTML density etc For example if the Contents of Page

in 2.3(a) changes to 2.3(b) apart from the layout, then based on content patterns it would

be very difficult to distinguish the ‘List price’ from ‘Our price’ Additionally, this methoddoes not examine re-induction at all

Trang 26

Jack,1234 Orchard Road 

James,3454 Siglap Road 

Jenny,22 Alexandra Road 

Jules,1 Kent Ridge 

local change or concept shift The Automatic maintenance system repairs wrappers under

this assumption of “small change”

Though this method tackles verification by classifiers built using content features ofextracted information For feature1 (Name), average length= 5.25, Number of Upper casecharacters =1, Numbers =0 etc The approach makes an effort to extend the conventionalforward wrappers with backward wrappers to create a multi-pass Wrapper verification ap-

proach In contrast to forward Wrappers, the backward wrappers scan files from the end to

the beginning The backward wrapper is similar in structure to the forward wrapper, andcan run into errors when the format changes However, because of the backward scanning,

it will fail at positions different from where the forward wrapper would fail This can ically work in case of errors generated due to typos or missing close tags in HTML pages,and help to fine tune the answers further

typ-If page Address o was changed to page Address m as in Fig 2.4, wrapper Address wrap f

Trang 27

would extract (Jack, 1234 Orchard Road), (James,3454 Siglap Road), (Jenny, 22

Alexandra Road Jules,1 Kent Ridge) on page Address m

The backward wrapper scanning page Address mfrom the backward direction would tract the tuples : (Jules, 1 Kent Ridge Road), (James,3454 Siglap Road), (Jack,

ex-1234 Orchard Road) on page Address m

Strengths: The forward-backward scanning is unique and seems to be a robust proach to handle wrapper verification, especially for missing attributes and tags Tested

ap-on the a database of 18 websites, including the Scientific Literature database DBLP, thismethod reports an average error of only 4.7% when using the Forward-backward wrapperswith the context classifier

Limitations: Though the forward-backward Wrapper approach has an advantage overother methods in verification when there are missing tags, the use of content features maynot be very effective in many cases Since information for even a single field can varyconsiderably overall statistical distribution measures may not be sufficient

2.2 Wrapper Reinduction Algorithms

Wrapper Reinduction is the process of learning a new wrapper if the current wrapper is

broken Wrapper reinduction is a tougher problem than Wrapper Verification Not onlythe wrapper has to be verified, a new wrapper should be constructed as well It requiresnew examples be provided for learning, which may be expensive when there are many sitesbeing wrapped The conventional wrapper induction models cannot be directly used for

Trang 28

reinduction since many of them required detailed manual labeling for training which canbecome a bottleneck for reinduction of wrappers So wrapper reinduction task usually dealswith locating training examples on the new page, automatically labeling it, and supplying

it to the wrapper induction module to learn the new wrapper

ROADRUNNER[18] is a method that uses unsupervised learning to learn the wrappers.Pages from the same website are supplied and a page comparison algorithm is used togenerate wrappers based on similarities and mismatches

The algorithm performs a detailed analysis of the HTML tag structure of the pages

to generate a wrapper to minimize mismatches This system employs wrappers based on

a class of regular expressions, called Union-Free Regular Expressions (UFRE’s) which arevery expressive The extraction process compares the tag structure between the samplepages and generates regular expressions that handle structural mismatches found betweenthem In this way, the algorithm discovers structures such as tuples, lists and variations[39]

An approach similar to ROADRUNNER was used by Arasu et al.[3] They propose anapproach to automatic data extraction by automatically inducing the underlying template

of some sample pages with the same structures from data-intensive web sites to deducetemplates from a set of template generated pages, and to extract the value encoded inthem However, this does not handle multiple values listed on one page

Trang 29

Strengths Since this method needs no examples to learn the wrappers, has an obviousstrength: it provides an alternative way to deal with the wrapper maintenance problem,especially in cases where there are no examples.

Limitations: Since ROADRUNNER searches in a larger wrapper space, the algorithm

is potentially inefficient The unsupervised learning method gives little control to the user.The user might want to make some refinements and only extract a specific subset of theavailable tuples In such cases, some amount of user input is clearly necessary to extractthe correct set of tuples Another problem of this approach is the need for many examples

to learn the Wrapper accurately [45]

Knoblock at el.[29] developed a method called DataPro for wrapper repairing in the case

of small mark-up change; it detects the most frequent patterns in the labeled strings; thesepatterns are searched in a page when the wrapper is broken Lerman et.al.[42] extendedthe above content-centric approach for verification and re-induction for their DataProgsystem The system takes a set of labeled example pages and attempts to induce content-based rules so that examples from new pages can be located Wrappers can be verified bycomparing the patterns of data returned to the learned statistical distribution When asignificant difference is found, an operator can be notified or the wrapper repair process can

be automatically launched

For example, by observing the street addresses listed in our example, we can see thatthey are not completely random: each has a numeric character followed by a capital letter

Trang 30

DataProg tries to derive a simple rule to identify this field as hALP HAihCAP Si etc Using

this, they locate the examples on the new page, which are passed to a wrapper inductionalgorithm (STALKER algorithm) to re-induce the wrapper This approach is similar to theapproaches used by Content-centric Wrapper tools [12, 56]

Strengths: The class of wrappers described by DataProg are very expressive sincethey can handle missing and rearranged attributes This approach applies machine learningtechniques to learn specific statistical distribution of the patterns for each field as againstthe generic approach used by Kushmerick [35] This approach assumes that the data repre-sentation is consistent, and by looking the test set, we can see that this can be successfullyused for representations which have strong distinguishing features like URLs, time, price,phone numbers etc

Limitations: For many cases like news, scientific publications or even for authornames, this approach cannot be used too well since there are no fixed content-based rules(Alphanumeric, Capitalized etc.) which can be identified to separate them from othercontent on the page For example, in case of the example illustrated in Figure 2.3 (a) and(b), this method will not detect any change, because the generic features and data patterns

of ‘List Price’ and ‘Our Price’ are the same It also could produce too many candidates

of data fields [45], many of which could be noises It fails at very long descriptions, and

is very sensitive to improper data coverage Lerman et.al.[42] quote a striking example ofthe data coverage problem that occurred for the stock quotes source: the day the trainingdata was collected, there were many more down movements in the stock price than up, andthe opposite was true on the day the test data was collected As a result, the price changefields for those two days were dissimilar The process of clustering the candidates for each

Trang 31

h!ELEMENT Addresses (Address+)i h!ELEMENT Address (Name, Street Name)i h!ELEMENT Name (#PCDATA)i

h!ELEMENT Street Name(#PCDATA)i

Figure 2.5: User defined schema for the Address Book Example

data field does not consider the relationship of all data fields (schema) [45]

SG-WRAM (Schema Guided Wrapper Maintenance)[45] is a recent method for that utilizesdata features such as syntactic features and annotations, for reinduction They base theirapproach on the assumption that some features of desired information in previous documentremain same, e.g syntactic (data types) hyperlink (whether or not a hyperlink is present)and annotation features (Any string that occurs before the data field) will be retained.They also assume that the underlying schemas will still be the same are still preserved inthe changed HTML document These features help the system to identify the locations ofthe content in the modified pages though tag structure analysis For our example, the userdefined schema would look like Figure 2.5

Internally the system computes mapping for each one of the fields above to the HTMLtree, and generates the extraction rule For each #PCDATA string, the features are high-lighted If the name Hyperlinked to another page, then the Hyperlink would be TRUE.Similarly, if each Street name was preceded by the string ‘Street’ the Annotation would be

‘Street’ For our case, the features are highlighted in Figure 2.6

Trang 32

Attribute Syntactic Hyperlink Annotation

Street Name [0-9]{0,}[A-Z][a-z]{0,} False NULLFigure 2.6: Content Features of the Address fieldFor simple changes in pages, this method depends on the syntactic and annotationfeatures, but in case the web site has undergone a structural change, this method uses theschema to locate structural groups and use them to extract data

Strengths: Since it relies on multiple features, this works better in many cases Incase of the example illustrated in Figures 2.3 (a) and (b), where syntactic differences arenot strong, this work considers annotation features (List Price, Our Price) Thus whenapplying the extraction rule, our approach will find that the annotations have changed andfind that the page has changed

Limitations: The assumption that data of the same topic will always be put together,similar to the user defined schema and will be retained even when changed is the basis ofthis approach However, if the data schema or the syntactic and tag structure changes, thenthis method is not effective

2.3 Summary

From our study, we observe a few key things about Wrapper generation, verification andmaintenance We observe that landmark-based Wrapper generation approaches are moresuitable for HTML pages, as compared to content-based approaches Conventional wrapperinduction algorithms cannot be extended into reinduction algorithms since most of them

Trang 33

need manual labeling of data The reinduction procedure should be automatic for continuousextraction of information from the web source.

Wrapper verification can be handled by heuristics It can be tackled by using globalstatistics [35] or using local attribute specific statistics [42] Since page structures are verycomplex, if a page changes its layout completely, it is very unlikely that any of previous

features will be interchanged with others, e.g, in case of Page P A and P B, may be rare.Hence it might be a common occurrence, that the wrapper returns null values when theweb site revamp happens Wrapper Verification can be treated independently of WrapperInduction Hence, existing methods are usually adequate for the purpose

In contrast, Wrapper reinduction is a far more difficult problem and has much scopefor investigation From our survey of related work, we observe that the main issues withthe current approaches are:

(i) Potentially inefficient either because of the need for detailed HTML grammar analysis

or due to searching in big wrapper spaces, which makes them inherently slow

(ii) The requirement that most of the data in the modified pages have effective features(syntactic patterns,annotations,etc.) These can be page specific, and hence make thereinduction approach difficult

(iii) The need for many training examples, for learning and reinduction This additionallyincludes cases in which the user has to specify a detailed schema which is not veryuser-friendly

Our goal in this thesis is to address these issues effectively We investigate wrapper

Trang 34

reinduction algorithms that are efficient, learn from a small number of examples and donot require strong assumptions on the data features In the next chapter, we describe ourapproach and present our algorithms.

Trang 36

time interval This content can be used to learn the new wrapper If some of the old tuplescan be detected in the page with the modified layout, we can apply wrapper induction tolearn the new layout This is the idea behind our reinduction algorithm.

To motivate our approach, let us consider a wrapper X for this website, which grabs

the headlines from the page Wrapper X extracts all the headlines present on the pageand stores it in a small repository If one day, a website revamp happens and the layout iscompletely changed, then X might not retrieve the headlines on the page The maintenancesystem then tries to locate the headlines stored in the repository and learn a new wrapperfrom these examples Once the new wrapper is created, the wrapper can be used to locateall the news headlines on the same page

Instead of trying to search in the wrapper space for a wrapper which will work, ormanually constructing training examples needed for reinduction, we try to learn the newwrapper from the few examples available to us So that when these examples, though few,are discovered on the new page, we can induce the wrapper and deploy it into the systemand it will be transparent to the user An illustration of the process flow in our Wrapper

Reinduction system, ReInduce is given in Figure 3.2.

The key here is, at the induction/ reinduction step, in such a system there might not

be too many training examples available The key problem to be solved here is Learningfrom a small number of examples, and especially when not all examples in the page areavailable to us In the following sections, we try to address this learning problem effectively

In this chapter, we propose our idea of reinduction algorithm

In the next section, we describe the formal framework for description of the Wrapper

Trang 37

(a)Content of Page at 1200 hrs

(b)Content of Page at 1400 hrs

Figure 3.1: Incremental Content changes in Channel News Asia Website

Trang 38

Figure 3.2: ReInduce: Wrapper Reinduction System

classes

3.2 Formalism

Resources, queries and responses: Consider the model where a site when queried with

a URL returns a HTML page An information resource can be described formally as

func-tion from a Query Q to a response P [33]

Information Resource

Attributes and Tuples: We assume a model similar to the relational data model.Associated with every information resource is a set of K distinct attributes, each representing

a column For example, Page P Ain the country name example in 1.3 has K=2

A tuple is a vector hA1, · · · A k i of K strings The string A k is the value of the k th

Trang 39

attribute This is similar to rows in a relational model There are M such tuples/ vectors

present on a page

If there are more than one tuple present on the page, then the k th attribute of the m th tuple will be represented as A m,k

Content and Labels: The content of a page is the set of tuples it contains A page’s

label is a representation of its content For example the label for Page P A in the country

name example in 1.3 is L A= (Jack, China), (John, USA), (Joseph, UK)

Wrappers: A wrapper takes as input, a page P and outputs a label L For wrapper

w and page P , we write w(P ) = L to indicate that w returns label L when invoked on P , e.g PCWrapper(P A ) = L A Hence, a wrapper can be described as a function from a queryresponse to a label

Wrapper Query response

A Wrapper class can be defined as a template for generating these wrappers Allwrappers belonging to a class will have similar execution steps

Wrapper Induction: Let W be a wrapper class and E = {hP1, L1i, · · · , hP N , L N i}

be a set of example pages and their labels Wrapper induction is the problem of finding

w ∈ W such that w(P n ) = L n , for all n = 1, · · · , N

Wrapper Verification: We say w is correct for P iff P ’s label is identical to w(P ).

Wrapper Reinduction: For a dynamic web site, it will also be a function of time

Trang 40

We assume the same model in which the site is queried with a URL q and observed at time instants t0, t1, · · · , t N Let:

• {P (t0), P (t1), · · · , P (t N )} be the pages in response to the queries, and

• {L(t0), L(t1), · · · , L(t N )} be the labels of the above pages.

The wrapper reinduction problem is, Given the example at time t0: hP (t0), L(t0)i Find wrappers w i ∈ W such that w i (P (t i )) = L(t i)

3.3 Generic Wrapper Reinduction Algorithm

Note that the wrapper reinduction problem is trivial if both the pages and labels remain

static, i.e P (t i ) = P (t0) and L(t i ) = L(t0) for i ≥ 1 Even if only the labels remain static, the problem is much simpler and reduces to the problem of inducing a wrapper w i at time

t i using hP (t i ), L(t0)i as the example.

However, when both the pages and labels vary, we cannot induce a wrapper

automat-ically since L(t i ) is not known for i ≥ 1 Note that this problem is, in general, not solvable

without making assumptions about the variations Lerman et al.[42] assume that the labelsfollow an implicit structure over all time instants In SG-WRAM[45], the data schema isassumed to be preserved

Định dạng
Số trang	83
Dung lượng	1 MB