UNIVERSITY OF ENGINEERING AND TKCHNOLOGY DO THUY DUONG RESEARCH AND APPLY EVOLUTIONARY COMPUTATION TECHNIQUES ON AUTOMATIC TEXT SUMMARIZATION Field: Information technology Major: Soft
Trang 1MASTER THESIS LN INFORMATION TECHNOLOGY
IIANOI - 2015
Trang 2
UNIVERSITY OF ENGINEERING AND TKCHNOLOGY
DO THUY DUONG
RESEARCH AND APPLY EVOLUTIONARY
COMPUTATION TECHNIQUES ON AUTOMATIC TEXT SUMMARIZATION
Field: Information technology Major: Software Engineering
Code: 60480103
MASTER TIIESIS IN INFORMATION TECIINOLOGY
SUPERVISOR: Assoc Prof Nguyen Xuan Hoai
HANOI - 2015
Trang 3
L Do Thuy Duong, declare that this thesis ‘Rescarch and apply evolutionary
computation techniques on automatic text summarization’ and the work
presented in it are my own
Thave acknowledged all main sources of help,
Where the thesis is based on work done by myself jointly with others, 1 have made clear exactly what was done by others and what | have contributed myself,
Signed:
Date
Trang 4Acknowledgements
1 am heartily thankful to my supervisor, Prof Nguyen Xuan Hoai, whose encouragement, guidance and support from the initial to the final level have enabled me to develop an understanding of the topic
L would like to show my gratitude to the teachers in the University of Engineering and Technology, Vietnam National University, Hanoi for helping
me to gain a large body of knowledge during my two years of studying
Lastly, | offer my regards and blessings to my friends and my family, who have always encouraged me so that Ï could finish this challenging research
Trang 5Automatic text summarization using differential evolution algorithm
3.1.1 Document collection representation,
3.1.2 Objective/ Fitness function
3.1 Automatic text summarization using differential evolution (DE)
Trang 63.1.3 Main steps of đifferential evolution
3.1.4 Experiment, result and discussion
Trang 7A summarizer highlights all sentences included in an extractive
General scheme of evolutionary algorithms 18
Correlation between number of generations and best fitness in
19
Steps of differential evolution algorithm 20
Steps to get the next X1 (generation 1) - 25
Tilustration of mutation operation - 32
Changes in summary length in [IDK] method ơn 13UC2004 38 Changes in summary length in [DE] method on DUC2007 39
Summary length in [MultiDE] method on DUC2004 43
Summary length in [MultiDE | method on DUC2007 43
Comparison between F-values of [DE] and [MultiDE] on DUC2004
45
46
Trang 8‘Lable 2.1 ‘The basic evolutionary computation linking natural evolution to
Table 2.2.Fitness of six individuals at generation 0 22
Table 2.5 Values of X1 in generation 1 - 24
'Table 3.1 12escription of the datasets used in the experiment 35
‘Table 3 2 Parameter settings of the first experiment - 37
Table 3.3 Summary lengths of some document collections in DUC2004 using
Table 3.9 F-Values of three evaluation measures of method [MultiDE] on
Trang 9Introduction
Automatic text summarization means detecting important and condensed
contents in one or more documents This is a very challenging problem, relating
to many scientific areas such as artificial intelligence, statistics, linguistics, ete
Many researches have been conducted world wide since 1950 and produced
some systems such as SUMMARIST, SweSUM, MEAD, SUMMON, etc Tlowever, this research area is still challenging and allracis more and more aLLcntion
In this thesis, we are going to study some evolutionary computation techniques, then apply the differential evolution algorithm to the practical problem:
automatic text summarization, in particular, multi-document summarization
Moreover, we also allempt to deal with constraint on the summary length that
has not been handled effectively in these stochastic popular-based methods
1.1 Motivation
Evolutionary computation techniques use different algorithms to evolve a population of individuals over a certain number of generations These
populahon are applicd with operations on such as mulation, crossover and
selection to repraduce new offspring, which then compete with each other and
the previous generation to survive based on some valuation function, The process ends when a stopping criteria is reached and we found the best individual — the best solution to our real-world problem
Evolutionary algorithms have been applied to solve numerous problems in
various fields, one of which is automatic text summarization Ilowever, we have
found it has a weak point im handling tho summary longth, not like other sentence ranking methods Therefore, this rescarch attempts to improve this
aspect of these algorithms
Trang 101.2 Research Objectives
The thesis is aimed to study evolutionary computation techniques, especially the differential evolution algorithm, and its application to the problem of automatic
text summarization We find the limitation of other researchers’ ways to handle
the summary length of this algorithm, then propose a new method to manage
this longth constraint satislying uscrs’ demand, bul still keep the quality of the
summary
1.3 Thesis overview
The rest of this thesis is organized as follows In chapter 2, we review the background knowledge of text summarization, its classification and introduce
the main principles of evolutionary computation In particular, the differential
evolution algorithm is discussed
Chapter 3 explains in details the above algorithm when applied to automatic text
summarization, in our case it is on multidocument collections Then, an
experiment is performed to test the original differential evolution algorithm
Besides, we improve the result of the previous experiment, dealing with the summary length se that the document collection is compressed quickly and
effectively
Chapter 4 will rocapitulate the thesi
future research directions in this field
present our contributions and statc some
Trang 112 Chapter 2
Background knowledge
In this chapter, text summarization is reviewed before we introduce and classify evolutionary computation Then, an evolutionary algorithm namely differential
evolution is discussed in details
2.1 Automatic text summarization
2.1.1 Definition
Automatic text summarization is the generation of a shorter version of a text by
a computor program but still keep the most important points of the original text
Us]
The aim of automated text summarization is to take a source text, extract the
most significant content from it, and present it in a condensed form and in a way sensitive to the user's or application's needs
A summarization system experiences some steps to generate a summary from a document or a collection of documents First of all, the document is
preprocessed, for example, handling punctuations, lower/upper case, splitting
paragraph, sentences, words, etc Then, document is represented in a certain
dala lype such as vectors, cach of which represent 4 sentence The third step,
known as the key phase, is lo creale the summary representation from the
document representation For instance, after this slagc, some of the above vectors are chosen to be included in the summary Finally, from the summary representation, we could form the summary via summary generation stage
Figure 2.] represents a typical summarization system
Trang 12Figure 2.1 A typical summarization system
2.1.2 Types of text summarization
There are some ways to classify approaches to automatic text summarization as
Trang 13[Syua, whúch controls Lebanen, is alowing the country to a havent for leading terrorist organizallons, including
remnants ef al-Caeda seeking retuge trom Afghanistan
At toast 25 percent of the groups designated as foreign terrorist organizations by the State Department have a presence in
Lebanon and are recelving some form of Syrian support Despite repeated calls trom the United States to end its support
or error, Darmascus, with the hip of Iran, is continuing to grant these groups safe haven, logistical assistance, traning
ocilties and political backing
ipons shipments are delivored regulary from Tehran and Damascus to Hizballah terrorists in Lebanon, rosulting in a
stockpile of atleast 19,000 Katyusha rockets with the capability of hitting majot Israeli population centers When tl
Palestinian Authority (PA) stlempted to import more than $0 tons of Iranian aims aboard the Karine-A ship, those
whole paper.
Trang 14EVALUATION MEASURES FOR TEXT
SUMMARIZATION
Josef STEINBERGER, Karel JEZEK
Department of Computer Science and Engineering
University of West Bohemia in Pilsen
Univerzitni 8
30614 Plzen, Czech Republic
email: {jstein, jezek.ka]@kiv.zcu.cz
Revised manuscript received 20 March 2007
Abstract We explain the ideas of automatic text summarization approaches and the taxonomy of summary evaluation methods Moreover, we propose a new eval-
nation measure for assessing the quality of a summary The core of the measure is
covered by Latent Semantic Analysis (LSA) which can capture the main topics of
a document, The summarization systems are ranked according to the similarity of the main topics of their summaries and their reference documents Results show
‘high correlation between human rankings and the LSA-based evaluation measure,
re is i wane a sun t xt a con Figure 2.3 An example of the abstract summary
- Audience:
© Generic: A generic summary provides the author’s point of views of the
source text, paying the same attention to every aspect of the text
© Query-oriented: A query-oriented (or user-oriented) summary prefers some particular aspects of the text, depending on aspects that a user
desires to learn about
- Usage:
® Indicative: An indicative summary only indicates the main subject matter or domain of the input text without including its contents After reading an indicative summary, one can explain what the input text was about, but not necessarily what was contained in it
e Informative: An informative summary covers (some of) the content, and
allows one to describe (parts of) what was in the input text
- Expansiveness:
Trang 15* Background: Assumes readers do not have prior knowledge about the
source text topic
¢ Just-the-news: Supposes reader’s prior knowledge is up-to-date
- Monolingual vs cross-lingual: Just summarizes in the same language vs summarizes as well as translates into another language
- Single-document vs multi-document source: Summarizes only one source text
vs, fuses together many source texts Figure 2.4 demonstrates a multi-document summarizer, which summarizes five documents into only one summary
Figure 2.4 Multi-document summarization
In this thesis, we intend to generate extractive summaries for multi-document collections Summarizing a single text is challenging enough, summarizing a
document collection poses even more difficulties We have to avoid repetitions,
manage potential inconsistencies among documents, but can still cover all
essential information of the original text
2.1.3 Methodologies for automatic text summarization
Up to now, there have been many methods applied to summarize text
automatically including [21]:
- Traditional methods: term, word, phrase frequencies
- Corpus-based approaches: combination of statistical features, learning to
extract
- Discourse structures: Word-net, Rhetorical analysis
- Knowledge rich approaches: different for particular domains
Evolutionary computation is a new approach to summarize text automatically, in
which solutions are evolved until a certain benchmark is satisfied.
Trang 162.2 Evolutionary computation
In computer science, evolutionary computation is a subfield of artificial intelligence, defined by some types of evolutionary algorithms which is based
on Darwinian principles They belong to the family of trial and error problem
solvers and can be regarded as plobal optimization methods with meta-heuristic
or stochastic oplimizalion character, in which there cxisis the utilization of a
population of candidate solutions [1]
Evolutionary computation uses continuous progression of the population, which
is then selected in a guided random search to get the required stop
Automated problem solving that uses [arwinian principles started in the 1950s However, three different imterpretations of this idea started to be implemented in
1960s in three strands
Rvolutionary programming (EP) was invented by I.awrenec J.Fogel in the US, while John Henry suggested a method named genetic algorithm (GA) Ingo Rechenberg and Hans-Paul Schwefel introduced evolution strategies (HS) Although these algorithms are proposed quite soon, they are only considered as different types of one technology known as evolutionary computation from the early nineties [1]
‘This is a concept based on natural evolution In nature, all plants and animals which can exist and adapt to the changing environment so far are the best ones,
not eliminated by the natural selection process Individuals in a population are
parents, producing new offspring through mutation and crossover These new children have to fight against others including their parents to survive in the next generation Overall, mutation and crossover diversify properties of offspring while natural selection results in an increase in the quality (fimess) of population
[2] Table 2.1 below shows us cquivalent concepls between natural evolution
and problem solving |3]
Trang 17
INITIALISE population with random candidate solutions;
EVALUATE each candidate;
REPEAT UNTIL ( TERMINATION CONDITION is satisfied ) DO
1 SELECT parents;
2 RECOMBINE pairs of parents;
3 MUTATE the resulting offspring;
4 EVALUATE new candidates;
6 SELECT individuals for the next generation;
Trang 18Figure 2.6 General scheme of evolutionary algorithms
Evolutionary computation consists of some algorithms which are used to search for optimal solutions to a problem
Higure 2.6 illustrates the transformation of typical population till the end An evolutionary algorithm starts by initializing a number of individuals fonning a
population Lach individual is evaluated based on a fitness function, which is
varied among types of algorithms and specific problem Some or all of these individuals are chosen to be parents They experienced reproduction operators to
produce new children Fitness values of those offspring are calculated In other
words, thosc ollspring’s quality arc assessed Bolicr ones between parents and children are chosen to be member of the next goneralion The process is repeated until the best individual is found based on a certain stopping criteria
According to A.E.Liben and J Smith, the typical progression of fitness in a run
as in the following Figure 2.7 [3]
Trang 19Time (number of generations)
Figure 2.7 Correlation between mimber of generations and best fitness in
population
The mechanisms deciding how to creale children and the way to choose among parents and children varies among specific evolution algorithms The [allowing section explains in delails a technique applied lo the real world problem: Automatic text summarization In this thesis, we deal with cxtractive multi-
document summarization
There are some lypical evolutionary algorithms such as: differential evolution , genetic algorilhm, genetic programming, cvolutionary programming, ctc In this
research, we focus on the first mentioned onc
2.3 Differential evolution (DE)
Differential Evolution appeared when Ken Price tried to solve the Chebychev
Polynomial fitting Problem [14] thal had been asked by Rainer Slom A
progress was made when Ken came up with the idea of using vector differences for porlurbing the vector population Since then, discussion between Kon and Rainer and computer simulations on botb parts brought in many considerable improvements which make IDK the flexible and powerful tool it is today The
"DE community” has been developing since the early IDK years of 1994 - 1996
and more researchers are working on and with DE Ken and Rainer wish that DE will be improved further by scientists around the world and that DE may help more users in their daily work ‘his wish is the reason why DE has not been
patented [8]
Trang 2020
Figure 2.8 Steps of differential evolution algorithm
In this algorithm, after initializing population of a certain number of individuals, each of which is a float-valued vector bounded in a specific range, these vectors (target vectors) might be binarized and evaluated based on fitness/objective function The idea of this algorithm is that new generations of individuals are created based on their parents and some operators like mutation — based on the difference of random sampled pairs of individuals, defining searching mechanism, crossover — exchange elements of a pair of individuals, increasing the diversity of features of offspring and selection — choose better individuals
between parents and their children to become member of the next generation
The process is repeated until the tus: generation is reached or a predefined
fitness value is satisfied The result of the algorithm will be the best individual
corresponding to a float-valued or binary vector P of n dimension In case of text
summarization , n is the number of sentences in document collection, P is
binary, P[i] = | illustrates the sentence is chosen to be included in the summary,
Trang 21otherwise the value P[i] = 0, i={1, 2, ., n} Figure 2.8 illustrates main steps of
a typical differential evolution algorithm
Pseudo-code of this algorithm is given below [15]
Begin
Generate randomly an initial population of solutions
Calculate the fitness of the initial population
Đo
Hor each parent
Select three different solutions at random
Create one offspring using DB operators (mutation,
Generate randomly an initial population of solutions:
Bach individual or solution contains values for x, x2 and x3 Thus, it is a threv- dimension vector and has the form Xp — [xpi Xpa Xpal We initialize P
individuals bounded in the interval [Xmia, Xmer] In this case, we choose xin — 0, Xem — Ì and P— 6
Trang 22Overall, we generate randomly six three-dimension vectors X1, X2, X3, X4, X5
and X6 such that elements of these vectors are bounded in (0,1)
Calculate the fitness of the initial population (generation 0):
Table 2.2 Fitness of six individuals al generation 0
For each parent:
Now these six individuals are going to produce their own children
Firstly, choose individual 1 as the first target vector (the first parent)
Select three different solutions at random:
We select randomly three different individuals, for example: individual 2, 4 and
6
Create one offspring using DF operators:
Mutation: Mutant vector V1 = [¥.1, Via Visl
Vii — Xei— F# (i-Xai)
i- {12,3}
¥: mutant factor
Trang 23
Table 2.3 Creation of mutant vector VÌ
The result of mutation operator is a mutant vector We say the mutant vector corresponding to target vector X1— [0.68, 0.89, 0.04] 1s V1 — [1.58 1.29, 0.35]
Crossover: The mutant vector V1 docs a crossover with the target vector X1 to create the trial vector 7.1 as shown in Table 2.4
Table 2.4 Creation of trial vector Z1
Trang 24If offspring is the same or better than its parent (selection), parents are replaced:
(X1) < f(41), then the trial vector becomes target vector X1 of the next generation (generation 1} as shown in Table 2.5
Table 2.5 Values of'X1 in generation }
The process continues with target vector X2 of generation 0 The process ends
when generation tux (user defined) is reached We will take the maximum f(X}
as the maximum fitness value and X as the final solution Figure 2.9 specifies steps to get the value of X1 at the generation 1
Trang 25F* (X2-X4)
Figure 2.9 Stepy to get the next.X1 (generation 1)
On the whole, propertics that we need to eare [ar in DE:
- Scelution representation: a real-valued or binary vector
- Number of individuals in population when initialized: Population is
usually initialized randomly, bounded in an interval Population size
shows the number of individuals m the population in a generation This is
an important parameter we need to decide If population size is too small,
the algorithm converge too fast, individuals can just reach a small part of the searching space On the other hand, population size is too big, leading
to resources waste, extending the searching process
- Objective/fitness function: ‘This function evaluates how good the solution
is, therefore it needs to be built carefully
- Operators [22]:
0 Mutation: The goal is to define searching mechanism of the
algorithm, generating mutant vectors This mutant factor (I)
decides how many perturbation ratios the svluuon can obtain If
mulant factor is great, the size of jump will be increased That is,
the population can break away the regional optimum cllcelively