Luận văn research and apply evolutionary computation techniques on automatic text summarization

UNIVERSITY OF ENGINEERING AND TKCHNOLOGY DO THUY DUONG RESEARCH AND APPLY EVOLUTIONARY COMPUTATION TECHNIQUES ON AUTOMATIC TEXT SUMMARIZATION Field: Information technology Major: Soft

Trang 1

MASTER THESIS LN INFORMATION TECHNOLOGY

IIANOI - 2015

Trang 2

UNIVERSITY OF ENGINEERING AND TKCHNOLOGY

DO THUY DUONG

RESEARCH AND APPLY EVOLUTIONARY

COMPUTATION TECHNIQUES ON AUTOMATIC TEXT SUMMARIZATION

Field: Information technology Major: Software Engineering

Code: 60480103

MASTER TIIESIS IN INFORMATION TECIINOLOGY

SUPERVISOR: Assoc Prof Nguyen Xuan Hoai

HANOI - 2015

Trang 3

L Do Thuy Duong, declare that this thesis ‘Rescarch and apply evolutionary

computation techniques on automatic text summarization’ and the work

presented in it are my own

Thave acknowledged all main sources of help,

Where the thesis is based on work done by myself jointly with others, 1 have made clear exactly what was done by others and what | have contributed myself,

Signed:

Date

Trang 4

Acknowledgements

1 am heartily thankful to my supervisor, Prof Nguyen Xuan Hoai, whose encouragement, guidance and support from the initial to the final level have enabled me to develop an understanding of the topic

L would like to show my gratitude to the teachers in the University of Engineering and Technology, Vietnam National University, Hanoi for helping

me to gain a large body of knowledge during my two years of studying

Lastly, | offer my regards and blessings to my friends and my family, who have always encouraged me so that Ï could finish this challenging research

Trang 5

Automatic text summarization using differential evolution algorithm

3.1.1 Document collection representation,

3.1.2 Objective/ Fitness function

3.1 Automatic text summarization using differential evolution (DE)

Trang 6

3.1.3 Main steps of đifferential evolution

3.1.4 Experiment, result and discussion

Trang 7

A summarizer highlights all sentences included in an extractive

General scheme of evolutionary algorithms 18

Correlation between number of generations and best fitness in

19

Steps of differential evolution algorithm 20

Steps to get the next X1 (generation 1) - 25

Tilustration of mutation operation - 32

Changes in summary length in [IDK] method ơn 13UC2004 38 Changes in summary length in [DE] method on DUC2007 39

Summary length in [MultiDE] method on DUC2004 43

Summary length in [MultiDE | method on DUC2007 43

Comparison between F-values of [DE] and [MultiDE] on DUC2004

45

46

Trang 8

‘Lable 2.1 ‘The basic evolutionary computation linking natural evolution to

Table 2.2.Fitness of six individuals at generation 0 22

Table 2.5 Values of X1 in generation 1 - 24

'Table 3.1 12escription of the datasets used in the experiment 35

‘Table 3 2 Parameter settings of the first experiment - 37

Table 3.3 Summary lengths of some document collections in DUC2004 using

Table 3.9 F-Values of three evaluation measures of method [MultiDE] on

Trang 9

Introduction

Automatic text summarization means detecting important and condensed

contents in one or more documents This is a very challenging problem, relating

to many scientific areas such as artificial intelligence, statistics, linguistics, ete

Many researches have been conducted world wide since 1950 and produced

some systems such as SUMMARIST, SweSUM, MEAD, SUMMON, etc Tlowever, this research area is still challenging and allracis more and more aLLcntion

In this thesis, we are going to study some evolutionary computation techniques, then apply the differential evolution algorithm to the practical problem:

automatic text summarization, in particular, multi-document summarization

Moreover, we also allempt to deal with constraint on the summary length that

has not been handled effectively in these stochastic popular-based methods

1.1 Motivation

Evolutionary computation techniques use different algorithms to evolve a population of individuals over a certain number of generations These

populahon are applicd with operations on such as mulation, crossover and

selection to repraduce new offspring, which then compete with each other and

the previous generation to survive based on some valuation function, The process ends when a stopping criteria is reached and we found the best individual — the best solution to our real-world problem

Evolutionary algorithms have been applied to solve numerous problems in

various fields, one of which is automatic text summarization Ilowever, we have

found it has a weak point im handling tho summary longth, not like other sentence ranking methods Therefore, this rescarch attempts to improve this

aspect of these algorithms

Trang 10

1.2 Research Objectives

The thesis is aimed to study evolutionary computation techniques, especially the differential evolution algorithm, and its application to the problem of automatic

text summarization We find the limitation of other researchers’ ways to handle

the summary length of this algorithm, then propose a new method to manage

this longth constraint satislying uscrs’ demand, bul still keep the quality of the

summary

1.3 Thesis overview

The rest of this thesis is organized as follows In chapter 2, we review the background knowledge of text summarization, its classification and introduce

the main principles of evolutionary computation In particular, the differential

evolution algorithm is discussed

Chapter 3 explains in details the above algorithm when applied to automatic text

summarization, in our case it is on multidocument collections Then, an

experiment is performed to test the original differential evolution algorithm

Besides, we improve the result of the previous experiment, dealing with the summary length se that the document collection is compressed quickly and

effectively

Chapter 4 will rocapitulate the thesi

future research directions in this field

present our contributions and statc some

Trang 11

2 Chapter 2

Background knowledge

In this chapter, text summarization is reviewed before we introduce and classify evolutionary computation Then, an evolutionary algorithm namely differential

evolution is discussed in details

2.1 Automatic text summarization

2.1.1 Definition

Automatic text summarization is the generation of a shorter version of a text by

a computor program but still keep the most important points of the original text

Us]

The aim of automated text summarization is to take a source text, extract the

most significant content from it, and present it in a condensed form and in a way sensitive to the user's or application's needs

A summarization system experiences some steps to generate a summary from a document or a collection of documents First of all, the document is

preprocessed, for example, handling punctuations, lower/upper case, splitting

paragraph, sentences, words, etc Then, document is represented in a certain

dala lype such as vectors, cach of which represent 4 sentence The third step,

known as the key phase, is lo creale the summary representation from the

document representation For instance, after this slagc, some of the above vectors are chosen to be included in the summary Finally, from the summary representation, we could form the summary via summary generation stage

Figure 2.] represents a typical summarization system

Trang 12

Figure 2.1 A typical summarization system

2.1.2 Types of text summarization

There are some ways to classify approaches to automatic text summarization as

Trang 13

[Syua, whúch controls Lebanen, is alowing the country to a havent for leading terrorist organizallons, including

remnants ef al-Caeda seeking retuge trom Afghanistan

At toast 25 percent of the groups designated as foreign terrorist organizations by the State Department have a presence in

Lebanon and are recelving some form of Syrian support Despite repeated calls trom the United States to end its support

or error, Darmascus, with the hip of Iran, is continuing to grant these groups safe haven, logistical assistance, traning

ocilties and political backing

ipons shipments are delivored regulary from Tehran and Damascus to Hizballah terrorists in Lebanon, rosulting in a

stockpile of atleast 19,000 Katyusha rockets with the capability of hitting majot Israeli population centers When tl

Palestinian Authority (PA) stlempted to import more than $0 tons of Iranian aims aboard the Karine-A ship, those

whole paper.

Trang 14

EVALUATION MEASURES FOR TEXT

SUMMARIZATION

Josef STEINBERGER, Karel JEZEK

Department of Computer Science and Engineering

University of West Bohemia in Pilsen

Univerzitni 8

30614 Plzen, Czech Republic

email: {jstein, jezek.ka]@kiv.zcu.cz

Revised manuscript received 20 March 2007

Abstract We explain the ideas of automatic text summarization approaches and the taxonomy of summary evaluation methods Moreover, we propose a new eval-

nation measure for assessing the quality of a summary The core of the measure is

covered by Latent Semantic Analysis (LSA) which can capture the main topics of

a document, The summarization systems are ranked according to the similarity of the main topics of their summaries and their reference documents Results show

‘high correlation between human rankings and the LSA-based evaluation measure,

re is i wane a sun t xt a con Figure 2.3 An example of the abstract summary

- Audience:

source text, paying the same attention to every aspect of the text

desires to learn about

- Usage:

® Indicative: An indicative summary only indicates the main subject matter or domain of the input text without including its contents After reading an indicative summary, one can explain what the input text was about, but not necessarily what was contained in it

e Informative: An informative summary covers (some of) the content, and

allows one to describe (parts of) what was in the input text

- Expansiveness:

Trang 15

* Background: Assumes readers do not have prior knowledge about the

source text topic

¢ Just-the-news: Supposes reader’s prior knowledge is up-to-date

- Monolingual vs cross-lingual: Just summarizes in the same language vs summarizes as well as translates into another language

- Single-document vs multi-document source: Summarizes only one source text

vs, fuses together many source texts Figure 2.4 demonstrates a multi-document summarizer, which summarizes five documents into only one summary

Figure 2.4 Multi-document summarization

In this thesis, we intend to generate extractive summaries for multi-document collections Summarizing a single text is challenging enough, summarizing a

document collection poses even more difficulties We have to avoid repetitions,

manage potential inconsistencies among documents, but can still cover all

essential information of the original text

2.1.3 Methodologies for automatic text summarization

Up to now, there have been many methods applied to summarize text

automatically including [21]:

- Traditional methods: term, word, phrase frequencies

- Corpus-based approaches: combination of statistical features, learning to

extract

- Discourse structures: Word-net, Rhetorical analysis

- Knowledge rich approaches: different for particular domains

Evolutionary computation is a new approach to summarize text automatically, in

which solutions are evolved until a certain benchmark is satisfied.

Trang 16

2.2 Evolutionary computation

In computer science, evolutionary computation is a subfield of artificial intelligence, defined by some types of evolutionary algorithms which is based

on Darwinian principles They belong to the family of trial and error problem

solvers and can be regarded as plobal optimization methods with meta-heuristic

or stochastic oplimizalion character, in which there cxisis the utilization of a

population of candidate solutions [1]

Evolutionary computation uses continuous progression of the population, which

is then selected in a guided random search to get the required stop

Automated problem solving that uses [arwinian principles started in the 1950s However, three different imterpretations of this idea started to be implemented in

1960s in three strands

Rvolutionary programming (EP) was invented by I.awrenec J.Fogel in the US, while John Henry suggested a method named genetic algorithm (GA) Ingo Rechenberg and Hans-Paul Schwefel introduced evolution strategies (HS) Although these algorithms are proposed quite soon, they are only considered as different types of one technology known as evolutionary computation from the early nineties [1]

‘This is a concept based on natural evolution In nature, all plants and animals which can exist and adapt to the changing environment so far are the best ones,

not eliminated by the natural selection process Individuals in a population are

parents, producing new offspring through mutation and crossover These new children have to fight against others including their parents to survive in the next generation Overall, mutation and crossover diversify properties of offspring while natural selection results in an increase in the quality (fimess) of population

[2] Table 2.1 below shows us cquivalent concepls between natural evolution

and problem solving |3]

Trang 17

INITIALISE population with random candidate solutions;

EVALUATE each candidate;

REPEAT UNTIL ( TERMINATION CONDITION is satisfied ) DO

1 SELECT parents;

2 RECOMBINE pairs of parents;

3 MUTATE the resulting offspring;

4 EVALUATE new candidates;

6 SELECT individuals for the next generation;

Trang 18

Figure 2.6 General scheme of evolutionary algorithms

Evolutionary computation consists of some algorithms which are used to search for optimal solutions to a problem

Higure 2.6 illustrates the transformation of typical population till the end An evolutionary algorithm starts by initializing a number of individuals fonning a

population Lach individual is evaluated based on a fitness function, which is

varied among types of algorithms and specific problem Some or all of these individuals are chosen to be parents They experienced reproduction operators to

produce new children Fitness values of those offspring are calculated In other

words, thosc ollspring’s quality arc assessed Bolicr ones between parents and children are chosen to be member of the next goneralion The process is repeated until the best individual is found based on a certain stopping criteria

According to A.E.Liben and J Smith, the typical progression of fitness in a run

as in the following Figure 2.7 [3]

Trang 19

Time (number of generations)

Figure 2.7 Correlation between mimber of generations and best fitness in

population

The mechanisms deciding how to creale children and the way to choose among parents and children varies among specific evolution algorithms The [allowing section explains in delails a technique applied lo the real world problem: Automatic text summarization In this thesis, we deal with cxtractive multi-

document summarization

There are some lypical evolutionary algorithms such as: differential evolution , genetic algorilhm, genetic programming, cvolutionary programming, ctc In this

research, we focus on the first mentioned onc

2.3 Differential evolution (DE)

Differential Evolution appeared when Ken Price tried to solve the Chebychev

Polynomial fitting Problem [14] thal had been asked by Rainer Slom A

progress was made when Ken came up with the idea of using vector differences for porlurbing the vector population Since then, discussion between Kon and Rainer and computer simulations on botb parts brought in many considerable improvements which make IDK the flexible and powerful tool it is today The

"DE community” has been developing since the early IDK years of 1994 - 1996

and more researchers are working on and with DE Ken and Rainer wish that DE will be improved further by scientists around the world and that DE may help more users in their daily work ‘his wish is the reason why DE has not been

patented [8]

Trang 20

20

Figure 2.8 Steps of differential evolution algorithm

In this algorithm, after initializing population of a certain number of individuals, each of which is a float-valued vector bounded in a specific range, these vectors (target vectors) might be binarized and evaluated based on fitness/objective function The idea of this algorithm is that new generations of individuals are created based on their parents and some operators like mutation — based on the difference of random sampled pairs of individuals, defining searching mechanism, crossover — exchange elements of a pair of individuals, increasing the diversity of features of offspring and selection — choose better individuals

between parents and their children to become member of the next generation

The process is repeated until the tus: generation is reached or a predefined

fitness value is satisfied The result of the algorithm will be the best individual

corresponding to a float-valued or binary vector P of n dimension In case of text

summarization , n is the number of sentences in document collection, P is

binary, P[i] = | illustrates the sentence is chosen to be included in the summary,

Trang 21

otherwise the value P[i] = 0, i={1, 2, ., n} Figure 2.8 illustrates main steps of

a typical differential evolution algorithm

Pseudo-code of this algorithm is given below [15]

Begin

Generate randomly an initial population of solutions

Calculate the fitness of the initial population

Đo

Hor each parent

Select three different solutions at random

Create one offspring using DB operators (mutation,

Generate randomly an initial population of solutions:

Bach individual or solution contains values for x, x2 and x3 Thus, it is a threv- dimension vector and has the form Xp — [xpi Xpa Xpal We initialize P

individuals bounded in the interval [Xmia, Xmer] In this case, we choose xin — 0, Xem — Ì and P— 6

Trang 22

Overall, we generate randomly six three-dimension vectors X1, X2, X3, X4, X5

and X6 such that elements of these vectors are bounded in (0,1)

Calculate the fitness of the initial population (generation 0):

Table 2.2 Fitness of six individuals al generation 0

For each parent:

Now these six individuals are going to produce their own children

Firstly, choose individual 1 as the first target vector (the first parent)

Select three different solutions at random:

We select randomly three different individuals, for example: individual 2, 4 and

6

Create one offspring using DF operators:

Mutation: Mutant vector V1 = [¥.1, Via Visl

Vii — Xei— F# (i-Xai)

i- {12,3}

¥: mutant factor

Trang 23

Table 2.3 Creation of mutant vector VÌ

The result of mutation operator is a mutant vector We say the mutant vector corresponding to target vector X1— [0.68, 0.89, 0.04] 1s V1 — [1.58 1.29, 0.35]

Crossover: The mutant vector V1 docs a crossover with the target vector X1 to create the trial vector 7.1 as shown in Table 2.4

Table 2.4 Creation of trial vector Z1

Trang 24

If offspring is the same or better than its parent (selection), parents are replaced:

(X1) < f(41), then the trial vector becomes target vector X1 of the next generation (generation 1} as shown in Table 2.5

Table 2.5 Values of'X1 in generation }

The process continues with target vector X2 of generation 0 The process ends

when generation tux (user defined) is reached We will take the maximum f(X}

as the maximum fitness value and X as the final solution Figure 2.9 specifies steps to get the value of X1 at the generation 1

Trang 25

F* (X2-X4)

Figure 2.9 Stepy to get the next.X1 (generation 1)

On the whole, propertics that we need to eare [ar in DE:

- Scelution representation: a real-valued or binary vector

- Number of individuals in population when initialized: Population is

usually initialized randomly, bounded in an interval Population size

shows the number of individuals m the population in a generation This is

an important parameter we need to decide If population size is too small,

the algorithm converge too fast, individuals can just reach a small part of the searching space On the other hand, population size is too big, leading

to resources waste, extending the searching process

- Objective/fitness function: ‘This function evaluates how good the solution

is, therefore it needs to be built carefully

- Operators [22]:

0 Mutation: The goal is to define searching mechanism of the

algorithm, generating mutant vectors This mutant factor (I)

decides how many perturbation ratios the svluuon can obtain If

mulant factor is great, the size of jump will be increased That is,

the population can break away the regional optimum cllcelively

Tiêu đề	Research and Apply Evolutionary Computation Techniques on Automatic Text Summarization
Tác giả	L Do Thuy Duong
Người hướng dẫn	Assoc. Prof. Nguyen Xuan Hoai
Trường học	Vietnam National University, Hanoi University of Engineering and Technology
Chuyên ngành	Information Technology
Thể loại	Thesis
Năm xuất bản	2015
Thành phố	Hanoi

Định dạng
Số trang	50
Dung lượng	861,55 KB