Lecture Notes in Computer Science- P15 doc

It should be able to distinguish content pages from non-content pages, and then extract main contents from content pages without using template and DOM-Tree.. Based on these features, we

Trang 1

To solve above problems, a new method need to be proposed It should be able to distinguish content pages from non-content pages, and then extract main contents from content pages without using template and DOM-Tree

In this paper, we propose a novel main contents extracting method Main contribu-tions include:

(1) Define a new concept of block and propose a block-partition method for web page Without using DOM-Tree and template, main contents and noise may be well partitioned into different blocks

(2) Define a concept of Block Distribution and study its features Based on these features, we employ classification method to distinguish content page from non-content page, and then employ outlier analysis to get main contents from Block Distribution

The remaining of this paper is organized as follows Section 2 gives a brief intro-duction to related works Section 3 represents blocks partition method for web page Section 4 introduces block distribution concept and its statistics feature Section 5 gives

a thorough study on performance of new method Section 6 summarizes our work

2 Related Works

Some works [3, 4, 5] have studied template-based methods on contents extraction of web pages Li [3] proposes a hybrid method that employed both tag sequence matching and tree matching to extract news from news web pages Geng [4] firstly generates mapping rules from specified news pages Then employ these rules to extract infor-mation from web page which have same page structure Yi [5] assumed that layout of web pages is fixed in same website He builds a Style Tree for the website Contents of web pages of the website may be well extracted by using Style Tree

Lin [6] partitions web page to blocks, then build profile vector for each block Ac-cording to the entropy value of each feature in a content block, the entropy of the block may be derived By entropy, blocks are determined being either informative or re-dundant Cai [8] utilizes visual cues of web pages, such as layout font size and color, to extract information Wang [9] proposes STU-DOM tree data structure which may be regarded as a DOM tree with some semantic contextual attributes Having been pruned, the STU2DOM tree can be used to automatically and accurately extract the useful and relevant contents from HTML document [10] uses heuristic method to partition web pages, and then calculate probability of individual block Contents will be extracted from blocks with high probability

There existed three kind of methods for web pages partition DOM-Based [11], Location-Based [12] and Visual-Based[8] [6] uses tag <TABLE> as basic granularity

to partition web page [7] proposes an efficient fragment-aware data structure to model dynamic web pages and detect fragments that are shared among documents

3 Block Partition of Web Page

On extracting main contents from web pages, some studies firstly convert the web page

to a DOM-Tree, and then get contents by traversing the DOM-Tree Compared with

Trang 2

these methods, we want to directly partition web page to blocks, and then store them to

a list structure without using DOM-Tree Due to saving complicated operation on DOM-Tree, Such method may have better time performance on extracting main con-tents than DOM-based method

In many studies [6, 10, 12], block is defined as a portion of web page between an open-tag and its corresponding close-tag Such blocks may contain much noise besides main contents, or main contents will be scattered to multiple blocks

We try to put main contents to one block without noises, therefore give a new defi-nition of block

Definition 1 (block and sub-block): Let S be a sequence of characters, which represents

a piece of HTML document For a pair of tags in S, <TAG>and </TAG>, s=(<TAG>, ,</TAG>)⊂S is a sub-sequence in S starting from <TAG> and ending in

</TAG> For any sub-sequence si⊂S, if ∃ sj⊂ si, (si - sj) is called as Block, otherwise

si is called as Block, denoted as B sj is called as sub-block of si

Block B consists of a pair of tag <TAG>, </TAG> and contents c between the two tags

si, sj are two sub-sequence corresponding to blocks bi, bj in S respectively If

i j

s ∩ ≠ ∅s , we call b i∩ ≠ ∅b j

Definition 2 (Block List): Let BSet be a block collection of a partitioned web page,

Block-List be a List storing BSet A node in Block-List corresponds to a block

b∈BSet Each node consists of two fields t and c where t registers open-tag of block b,

c registers content of b

Figure 1 gives an example for Block-List By analyzing structure of web page, we get Observation 1

Observation 1: (1) most of tags in HTML documents usually occur in pairs A pair of

tags consists of an open-tag <TAG> and a close-tag </TAG> Contents of web page appear between tags Tag pairs occur in embedment, for example, <table><tr></tr>

</table> (2) Some tags may appear in crossing, for example, <Table><Form></Table>

</Form> (3) Some tags may occur in single, for example, <br>,<p> (4) Some web pages do not strictly comply with HTML regulation Some tags fail to occur in pairs,

we call them missing tag

After eliminating crossing tags, single tags and missing tags, all tags in HTML

documents will occur in pairs and embedment Such HTML documents is called nor-malized HTML document, which may be made by using some techniques This paper assumes all HTML documents involved in our work are normalized

With some tests, we have observed that following techniques may well partition web page (1) Holding tags involving structure of web page, for example <TABLE>, <TR>,

<TD>, <DIV> (2) Neglecting denoting tags, for example <FONT>, <SPAN> (3) Skipping tag-pairs which are unrelated with contents of web page, for example

<STYLE>, <SCRIPT> (4) <A> are regarded as structure tag

To partition a piece of web page to blocks defined in this paper, new method need to

be proposed

Trang 3

We use a stack to aid blocks-partition for web page On scanning web page, once an open-tag is met, a block will be built Then the block is inserted to Block-List, in the meantime the open-tag and reference of the block are pushed to stack Top tag in the stack will be popped when a close-tag is met Whatever tag is met, contents between the tag and former tag will be extracted Then insert them to block corresponding to top element in stack

This method is simpler than DOM-based method Algorithm 1 describes the process

of blocks-partition

Algorithm 1 web page partition block

Input: HTML document f

Output: Block-List BL

1 sÅbuild_aid_stack (); BLÅbuild_Block_list ();

2 while( NOT EOF of f){

3 tagÅgetNextTag();

4 contentÅgetContent(); //get contents between current tag and former tag

5 blockÅgetTop(); //get block corresponding to top tag in stack

6 insert (content, block) //put contents to the block

7 If (isNeglect (tag) ) continue; // is insignificant tag?

8 If (isJump (tag)) //is skipped tag?

9 {jump(); continue;} //skip tags and contents between them

10 If (isOpenTag (tag)) { //is open tag?

11 blockÅnew Block(tag)

12 insert (BL, block); // insert new block to Block-List

13 push (s, tag, block); // put tag and reference of block to stack

14 }else //is close tag?

15 pop (s);

16 } /* end of while */

Lemma 1: Given a piece of HTML document f Time cost of building Block-List and

DOM-Tree for f are t1 and t2 respectively t1< t2 may be concluded

Rational: Let t1_T and t2_T be time cost of scanning f on building Block_List and DOM-Tree respectively, t1_I, t2_I be time of inserting contents to Block_List and

DOM-Tree respectively t1≈t1_T+t1_I, t2≈t2_T+t2_I (1) On building Block-List, some tag-pairs may be omitted or skipped However each tag will be process on

building DOM-Tree Thus, t1_T < t2_T (2) inserting contents to Block_List is a simple operation by getting reference of block from top element of stack However, before inserting contents to DOM-Tree, inserting position must be located in DOM-Tree

Thus, t1_I < t1_I (3) By (1) and (2), thus, t1< t2

Example 1: Use Algorithm 1 to build Block-List for web page shown in Fig.1(a)

Fig.1(b) is derived Block-List

Trang 4

<DIV> Hello

<DIV> GBK </DIV> World

</DIV>

DIV TABLE TR TR Hello

World

TD USA

TD CHN

DIV GBK

Fig 1(a) A portion of HTML document Fig 1(b) Block-List

4 Block Distribution and Main Contents Extraction

In practical application, before extracting contents from web pages, first step is to determine whether web pages contain main contents Web pages may be are divided to two types, content page and non-content page For non-content page, there are various information in the page except for main contents For content page, it contains a main contents Fig 2 gives an example of content page and non-content page

To distinguish content page and non-content page, it is needed to study features of web page, and then use these features to classify web pages

Main content

Index of blocks

Fig 2 (a) Non-content page Fig 2(b) Content page Fig 3 Curve of block distribution

Definition 3 (Block Distribution and Block Distribution Curve): Given a Block-List

BL Let o be a node in BL, c be content of o n be size of c A collection of {n1,…, nk}

represents size of all blocks in BL After the collection is sorted in descending order, we call the sequence D =(n1,…, nk) as block distribution of web page Let y-axis represents

ni and x-axis represents index of ni in D D is represented in a piecewise curve, called

Block Distribution Curve

By using Algorithm 1, we can derive Block Distributions for web pages Fig 3 shows example of Block Distribution Curve Algorithm 1 may well put main contents to one

Trang 5

block, and scatter noise to multiple blocks If content block is large enough, then Block Distribution of content page and non-content page will appear obvious difference For example, in Fig 3, Block Distribution Curve of content page, Curve 1, is steeper than a Block Distribution Curve of non-content page, Curve 5

+k,…, nm), k>0, are block distributions of two piece of web page Value of D1 is equal

to D2 except for value in index 1 Then Dev (D1)> Dev (D2) can be concluded

Proof: see appendix

Lemma 2 shows that the larger size of main content block is in a Block Distribution, the larger variance Block Distribution has Because there are not obvious large block in Block-Distribution of non-content page, its variance will be small Therefore variance may be used to distinguish content pages and non-content pages

However, sometimes only using variance could not get enough good result Test 2 in section 6 demonstrates that only using variance to distinguish content and non-content page could not get enough good accuracy So we introduce a new feature for block distribution

Distribu-tion of a piece of web page In Block DistribuDistribu-tion Curve of D, αi (i=1,…,k-1) is rate of slope of a piece of curve β( )D = Max(α1, ,αk−1)−Min( , ,α1 αk−1) is called as the

bending of D

If there existed two blocks that have same size in a Block Distribution, bending means maximum difference between two adjacent blocks in the Block Distribution For

ex-ample, bending of Block Distribution D1=(5, 2, 2, 2, 2) are β(D1)=3

After deriving variance and bending of each Block Distribution, classification algo-rithm may be employed to distinguish content pages from non-content pages Test 2 shows that classification methods may well distinguish content pages and non-content pages based on the two features

In content pages, main content block is large and sparse Corresponding to noise blocks, it is suitable to consider content blocks as outlier In application, we employ deviation-based outlier detection algorithm [13] to derive content blocks Contents in content blocks are main contents of a piece of web pages Experiments demonstrate feasibility of our method

5 Experiments and Results

In this section, we will perform a thorough analysis for our method All experiments were implemented in Java and conducted on an Intel P2.6G system with 512M of RAM

5.1 Dataset

Experiments are conducted on three data sets Dataset1 consists of 543 piece of web pages (220 for content pages (news page), 323 for non-content pages) collected from website SOHU, YAHOO, CHINA and Netease Dataset2 come from Chinese Web

Tiêu đề	Web Contents Extracting for Web-Based Learning
Tác giả	J. Qiu, et al.
Trường học	Not Available
Chuyên ngành	Web-Based Learning
Thể loại	Not Available
Năm xuất bản	Not Available
Thành phố	Not Available

Định dạng
Số trang	5
Dung lượng	253,33 KB