UNIVERSITY OF ENGINEERING AND TECHNOLOGY DO THUY DUONG RESEARCH AND APPLY EVOLUTIONARY COMPUTATION TECHNIQUES ON AUTOMATIC TEXT SUMMARIZATION MASTER THESIS IN INFORMATION TECHNOLOGY
Trang 1UNIVERSITY OF ENGINEERING AND TECHNOLOGY
DO THUY DUONG
RESEARCH AND APPLY EVOLUTIONARY
COMPUTATION TECHNIQUES ON
AUTOMATIC TEXT SUMMARIZATION
MASTER THESIS IN INFORMATION TECHNOLOGY
HANOI - 2015
Trang 2UNIVERSITY OF ENGINEERING AND TECHNOLOGY
DO THUY DUONG
RESEARCH AND APPLY EVOLUTIONARY
COMPUTATION TECHNIQUES ON
AUTOMATIC TEXT SUMMARIZATION
Field: Information technology Major: Software Engineering Code: 60480103
MASTER THESIS IN INFORMATION TECHNOLOGY
SUPERVISOR: Assoc Prof Nguyen Xuan Hoai
HANOI - 2015
Trang 3Declaration of authorship
I, Do Thuy Duong, declare that this thesis ‘Research and apply evolutionary computation techniques on automatic text summarization’ and the work presented in it are my own
Where I have consulted the published work of others, this is always clearly attributed;
I have acknowledged all main sources of help;
Where the thesis 1s based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself;
Trang 4Acknowledgements
IT am heartily thankful to my supervisor, Prof Nguyen Xuan Hoai, whose encouragement, guidance and support from the initial to the final level have enabled me to develop an understanding of the topic
I would like to show my gratitude to the teachers in the University of Engineering and Technology, Vietnam National University, Hanoi for helping
me to gain a large body of knowledge during my two years of studying
Lastly, I offer my regards and blessings to my friends and my family, who have always encouraged me so that I could finish this challenging research
Trang 5Contents
Declaration Of AUthOrship ccc cccccccccccccceceeeeeeeseesenneeenneeeceeeeeeeeceeceeeeeeeeeeeeeeenaaas 3 (9 $i92Is 42019) 00070777 e 4 Ô9nï9ì).NGUdddddddddƠ 5
"2W? 00000) 0Ave0iiì 0) in ~ 16 2.3 _ Differential evolution (DE) . - c cccS nh creg 19
Automatic text summarization using differential evolution algorithm 27 3.1 Automatic text summarization using differential evolution (DE) 27 3.1.1 Document collecfion represenfafIOI - « «<< «s2 27 3.1.2 ObJective/ Fltness ÍUunCtIOI - c2 3 31 xx2 28
Trang 63.1.3 Main steps of differential evolutIOT -< << «<< s+2 30 3.1.4 Experiment, result and dISCuSSIOI << << << +2 35
3.2.1 Method LH HH ngà 40 3.2.2 Experiment, result and đISCuSSION << << «<< s2 42
“m9 47 Conclusion and TufUF€ WOTK «cọ kh 47
4.2 — PULUT€ WOTĂK ĂQQQ QQQ HnnH n nH nà 47
5 —- Reference eee cece cecceseccceccceccccscccsccccscccscccuscccseccescccssccuscccscccussensecctsecatccanecs 48
Trang 7List
Figure 2.1
Figure 2.2
summary
Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 2.7 population Figure 2.8 Figure 2.9 Figure 3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Figure 3.6 Figure 3.7 of figures A typIcal summar1ZafiOn SVS{€I - c2 < + + S1 S1 ve 12 A summarizer highlights all sentences included in an extractive H111 nọ 13 An example of the abstracf SummmarV . ««« «s5 ssssssssssssss 14 Multi-document Ssummar1Za(IOII . 5 - << + +*s*x«2 15 The general scheme of an Evolutionary Algorithm in pseudo-code 17 General scheme of evolutionary algorIthims «+ s++<<<<+ 18 Correlation between number of generations and best fitness in ¬ 19
Sfeps of differential evolutIlon aløorIthm << ++<<<<<<+ 20 Steps to get the next X1 (generation Ï) . - + s+<<<<<<<<s+ 25 Tllustration Of mutation OPeratiONn .cccccceeeeeeeseesessesestetteeeeeees 32 Tllustration Of CrOSSOVEr OPETAllON cc ceccccceceeceeeeeeesessesnsteteeaeeeeees 33 Changes in summary length in [DE] method on DUC2004 38
Changes in summary length in [DE] method on DUC2007 39
Summary length in [MultiDE] method on DUC2004 43
Summary length in [MultiDE] method on DUC2007 43 Comparison between F-values of [DE] and [MultitDE] on DUC2004
Trang 8Table 2.1 The basic evolutionary computation linking natural evolution to
DX09208.12021522125755 17 Table 2.2.Fitness of six individuals at generation Ô - «<< «2 22 Table 2.3 Creatlon of mufant veCfOT VÌ cà HH ngờ 23 Table 2.4 Creation of trial V€C(OT Z2Í c0 HH ngờ 23 Table 2.5 Values of XI In øenerafION Ì «5c * 33333 vvssseees 24 Table 3.1 Description of the datasets used in the experImeIn( - ‹- 35 Table 3.2 Parameter settings of the first eXDerIm€H( .-«« «5 «55555 s52 37 Table 3.3 Summary lengths of some document collections in DUC2004 using [DE] method - - - ceccccescccenscceessccesscceesccceuscccesscceusssceuseceeseceeussceuaeceueess 38 Table 3.4 Summary lengths of some document collections in DUC2007 using [DE] method - - - ceccccescccenscceessccesscceesccceuscccesscceusssceuseceeseceeussceuaeceueess 40 Table 3.5 F-Values of three evaluation measures of method [DE] on DUC2004 and DUC2O7 - + + 119991 1111 1 111 ng go nh 40 Table 3.6 Parameter settIngs of the second experImeñI - ««« «s5 «5 «<2 42 Table 3.7 Summary lengths of some document collections in DUC2004 using [MultiDE] method wo cece ccccecccessccesscccessccescccesssceusssceussceeseceeessceeaeceeuacs 44 Table 3.8 Summary lengths of some document collections in DUC2007 using [MultiDE] method wo cece ccccecccessccesscccessccescccesssceusssceussceeseceeessceeaeceeuacs 44 Table 3.9 F-Values of three evaluation measures of method [MultiDE] on
Trang 9In this thesis, we are going to study some evolutionary computation techniques, then apply the differential evolution algorithm to the practical problem: automatic text summarization, in particular, multi-document summarization Moreover, we also attempt to deal with constraint on the summary length that has not been handled effectively in these stochastic popular-based methods
1.1 Motivation
Evolutionary computation techniques use different algorithms to evolve a population of individuals over a certain number of generations These population are applied with operations on such as mutation, crossover and selection to reproduce new offspring, which then compete with each other and the previous generation to survive based on some evaluation function The process ends when a stopping criteria is reached and we found the best individual — the best solution to our real-world problem
Evolutionary algorithms have been applied to solve numerous problems in various fields, one of which is automatic text summarization However, we have found it has a weak point in handling the summary length, not like other sentence ranking methods Therefore, this research attempts to improve this aspect of these algorithms
Trang 101.2 Research Objectives
The thesis is aimed to study evolutionary computation techniques, especially the differential evolution algorithm, and its application to the problem of automatic text summarization We find the limitation of other researchers’ ways to handle the summary length of this algorithm, then propose a new method to manage this length constraint satisfying users’ demand, but still keep the quality of the summary
1.3 Thesis overview
The rest of this thesis is organized as follows In chapter 2, we review the background knowledge of text summarization, its classification and introduce the main principles of evolutionary computation In particular, the differential evolution algorithm is discussed
Chapter 3 explains in details the above algorithm when applied to automatic text summarization, in our case it is on miulti-document collections Then, an experiment is performed to test the original differential evolution algorithm Besides, we improve the result of the previous experiment, dealing with the summary length so that the document collection is compressed quickly and effectively
Chapter 4 will recapitulate the thesis, present our contributions and state some future research directions in this field
Trang 112 Chapter 2
Background knowledge
In this chapter, text summarization is reviewed before we introduce and classify evolutionary computation Then, an evolutionary algorithm namely differential evolution is discussed in details
2.1 Automatic text summarization
2.1.1 Definition
Automatic text summarization is the generation of a shorter version of a text by
a computer program but still keep the most important points of the original text
[6]
The aim of automated text summarization is to take a source text, extract the most significant content from it, and present it in a condensed form and in a way sensitive to the user’s or application’s needs
A summarization system experiences some steps to generate a summary from a document or a collection of documents First of all, the document is preprocessed, for example, handling punctuations, lower/upper case, splitting paragraph, sentences, words, etc Then, document is represented in a certain data type such as vectors, each of which represent a sentence The third step, known as the key phase, is to create the summary representation from the document representation For instance, after this stage, some of the above vectors are chosen to be included in the summary Finally, from the summary representation, we could form the summary via summary generation stage Figure 2.1 represents a typical summarization system
Trang 12
Summary generation
Figure 2.1 A typical summarization system
2.1.2 Types of text summarization
There are some ways to classify approaches to automatic text summarization as follows: [16]
- Content:
e Extract: An extract-type summary only contains units ranging from single words to whole paragraphs, which are taken verbatim from the original text Figure 2.2 presents a summarization system which selects important sentences to be included in the extractive summary
Trang 13Syria, which controls Lebanon, is allowing the country to serve as a haven for leading terrorist organizations, including
remnants of al-Qaeda seeking refuge from Afghanistan
At least 25 percent of the groups designated as foreign terrorist organizations by the State Department have a presence in Lebanon and are receiving some form of Syrian support Despite repeated calls from the United States to end its support for terror, Damascus, with the help of Iran, is continuing to grant these groups safe haven, logistical assistance, training facilities and political backing
The report also notes that the Lebanese government has so far
"refused to freeze the assets of Hizballah or close down the offices of rejectionist Palestinian organizations."
Iran's close cooperation with Lebanon-based terrorist groups was evident during a recent visit by Iranian President
Mohammed Khatami to Beirut
Weapons shipments are delivered regularly from Tehran and Damascus to Hizballah terrorists in Lebanon, resulting in a
stockpile of at least 10,000 Katyusha rockets with the capability of hitting major Israeli population centers When the
Palestinian Authority (PA) attempted to import more than 50 tons of Iranian arms aboard the Karine-A ship, those
nurchases were candiicted hv PA financial advisar Fuad Shihaki during a meeting in | ahannn >4
Article talks about: terror, lebanon group al-qaeda,hizballah,
Figure 2.2 A summarizer highlights all sentences included in an extractive
summary
e Abstract: An abstract-type summary is a newly generated text, covering the source text’s content as well as the source text reviews, which requires the summarizer to have prior knowledge about the source text topic The following Figure 2.3 captures an abstract summarizing the content of the whole paper
Trang 14EVALUATION MEASURES FOR TEXT
SUMMARIZATION
Josef STEINBERGER Karel JEZEK
Department of Computer Science and Engineering
University of West Bohemia in Pilsen
Univerzitni 8
306 14 Plzeri, Czech Republic
e-mail: {jstein, jezekka}@kiv.zcu.cz
Revised manuscript received 20 March 2007
Abstract We explain the ideas of automatic text summarization approaches and
the taxonomy of summary evaluation methods Moreover, we propose a new eval-
uation measure for assessing the quality of a summary The core of the measure is
covered by Latent Semantic Analysis (LSA) which can capture the main topics of
a document The summarization systems are ranked according to the similarity of
the main topics of their summaries and their reference documents Results show
a high correlation between human rankings and the LSA-based evaluation measure
The measure is designed to compare a summary with its full text It can com-
- Usage:
e Indicative: An indicative summary only indicates the main — subject matter or domain of the input text without including its contents After reading an indicative summary, one can explain what the input text was about, but not necessarily what was contained in it
e Informative: An informative summary covers (some of) the content, and allows one to describe (parts of) what was in the input text
- Expansiveness:
Trang 15e Background: Assumes readers do not have prior knowledge about the source text topic
e Just-the-news: Supposes reader’s prior knowledge is up-to-date
- Monolingual vs cross-lingual: Just summarizes in the same language vs summarizes as well as translates into another language
- Single-document vs multi-document source: Summarizes only one source text
vs fuses together many source texts Figure 2.4 demonstrates a multi-document summarizer, which summarizes five documents into only one summary
2.1.3 Methodologies for automatic text summarization
Up to now, there have been many methods applied to summarize text automatically including [21]:
- Traditional methods: term, word, phrase frequencies
- Corpus-based approaches: combination of statistical features, learning to
extract
- Discourse structures: Word-net, Rhetorical analysis
- Knowledge rich approaches: different for particular domains
Evolutionary computation is a new approach to summarize text automatically, in which solutions are evolved until a certain benchmark is satisfied
Trang 16Evolutionary computation uses continuous progression of the population, which
is then selected in a guided random search to get the required stop
Automated problem solving that uses Darwinian principles started in the 1950s However, three different interpretations of this idea started to be implemented in 1960s in three strands
Evolutionary programming (EP) was invented by Lawrence J.Fogel in the US, while John Henry suggested a method named genetic algorithm (GA) Ingo Rechenberg and Hans-Paul Schwefel introduced evolution strategies (ES) Although these algorithms are proposed quite soon, they are only considered as different types of one technology known as evolutionary computation from the early nineties [1 ]
This is a concept based on natural evolution In nature, all plants and animals which can exist and adapt to the changing environment so far are the best ones, not eliminated by the natural selection process Individuals in a population are parents, producing new offspring through mutation and crossover These new children have to fight against others including their parents to survive in the next generation Overall, mutation and crossover diversify properties of offspring while natural selection results in an increase in the quality (fitness) of population [2] Table 2.1 below shows us equivalent concepts between natural evolution and problem solving [3]
Trang 17INITIALISE population with random candidate solutions;
EVALUATE each candidate;
REPEAT UNTIL ( TERMINATION CONDITION is satisfied ) DO
1 SELECT parents;
2 RECOMBINE pairs of parents;
3 MUTATE the resulting offspring;
4 EVALUATE new candidates;
5 SELECT individuals for the next generation;
Trang 18
Parent selection
»' Pcrerls Inlflcllscrlon
Offspring Survivor selection
Figure 2.6 General scheme of evolutionary algorithms
Evolutionary computation consists of some algorithms which are used to search for optimal solutions to a problem
Figure 2.6 illustrates the transformation of typical population till the end An evolutionary algorithm starts by initializing a number of individuals forming a population Each individual is evaluated based on a fitness function, which is varied among types of algorithms and specific problem Some or all of these individuals are chosen to be parents They experienced reproduction operators to produce new children Fitness values of those offspring are calculated In other words, those offspring’s quality are assessed Better ones between parents and children are chosen to be member of the next generation The process is repeated until the best individual is found based on a certain stopping criteria
According to A.E.Eiben and J.E.Smith, the typical progression of fitness in a run
is in the following Figure 2.7 [3]:
Trang 19
Time (number of generations)
Figure 2.7 Correlation between number of generations and best fitness in
population The mechanisms deciding how to create children and the way to choose among parents and children varies among specific evolution algorithms The following section explains in details a technique applied to the real world problem: Automatic text summarization In this thesis, we deal with extractive multi- document summarization
There are some typical evolutionary algorithms such as: differential evolution , genetic algorithm, genetic programming, evolutionary programming, etc In this research, we focus on the first mentioned one
2.3 Differential evolution (DE)
Differential Evolution appeared when Ken Price tried to solve the Chebychev Polynomial fitting Problem [14] that had been asked by Rainer Storn A progress was made when Ken came up with the idea of using vector differences for perturbing the vector population Since then, discussion between Ken and Rainer and computer simulations on both parts brought in many considerable improvements which make DE the flexible and powerful tool it is today The
"DE community" has been developing since the early DE years of 1994 - 1996 and more researchers are working on and with DE Ken and Rainer wish that DE will be improved further by scientists around the world and that DE may help more users in their daily work This wish is the reason why DE has not been patented [8]
Trang 20
In this algorithm, after initializing population of a certain number of individuals, each of which is a float-valued vector bounded in a specific range, these vectors (target vectors) might be binarized and evaluated based on fitness/objective function The idea of this algorithm is that new generations of individuals are created based on their parents and some operators like mutation — based on the difference of random sampled pairs of individuals, defining searching mechanism, crossover — exchange elements of a pair of individuals, increasing the diversity of features of offspring and selection — choose better individuals between parents and their children to become member of the next generation The process is repeated until the ty, generation is reached or a predefined fitness value is satisfied The result of the algorithm will be the best individual corresponding to a float-valued or binary vector P of n dimension In case of text summarization , n is the number of sentences in document collection, P is binary, P[i] = 1 illustrates the sentence is chosen to be included in the summary;
Trang 21otherwise the value P[i] = 0, i={1, 2, ., n} Figure 2.8 illustrates main steps of
a typical differential evolution algorithm
Pseudo-code of this algorithm is given below [15]:
Begin
Generate randomly an initial population of solutions
Calculate the fitness of the initial population
Do
For each parent
Select three different solutions at random Create one offspring using DE operators (mutation, crossover)
If offspring is the same or better than its parent (selection)
Parent is replaced End For
While the stopping condition is not satisfied
End
Example [7]:
The following numerical example is given to demonstrate the DE algorithm We have the objective/fitness function:
Maximize f(X) = x; + X2 + x3in which X = [X}, Xo, Xa]
Our goal is to find xj, Xz, X3 We will follow steps in pseudo-code above to solve this problem
Generate randomly an initial population of solutions:
Each individual or solution contains values for x;, x» and x3 Thus, it is a three- dimension vector and has the form Xp = [Xp1, Xp2, Xp3] We initialize P individuals bounded in the interval [Xwin, Xmax| In this case, we choose Xin = O, Xmax = 1 and P= 6
Trang 22Overall, we generate randomly six three-dimension vectors X1, X2, X3, X4, X5 and X6 such that elements of these vectors are bounded in (0,1)
Calculate the fitness of the initial population (generation 0):
Now these six individuals are going to produce their own children
Firstly, choose individual | as the first target vector (the first parent):
Select three different solutions at random:
We select randomly three different individuals, for example: individual 2, 4 and
6
Create one offspring using DE operators:
Mutation: Mutant vector V1 = [v1 1, Vi2, V1.3]
Vii = Xoit F* (X2i-X4;)
i= {1,2,3}
F: mutant factor
Trang 23
Table 2.3 Creation of mutant vector V1 The result of mutation operator is a mutant vector We say the mutant vector corresponding to target vector X1= [0.68, 0.89, 0.04] is V1 = [1.58, 1.29, 0.35] Crossover: The mutant vector V1 does a crossover with the target vector X1 to create the trial vector Z1 as shown in Table 2.4
vector
If k = 1, then
Trang 24If offspring is the same or better than its parent (selection), parents are replaced:
f(XI) < f£(Z1), then the trial vector becomes target vector X1 of the next generation (generation 1) as shown in Table 2.5
as the maximum fitness value and X as the final solution Figure 2.9 specifies steps to get the value of X1 at the generation I:
Trang 25
Figure 2.9 Steps to get the next XI (generation l)
On the whole, properties that we need to care for in DE:
- Solution representation: a real-valued or binary vector
- Number of individuals in population when initialized: Population is usually initialized randomly, bounded in an interval Population size shows the number of individuals in the population in a generation This is
an important parameter we need to decide If population size is too small, the algorithm converge too fast, individuals can just reach a small part of the searching space On the other hand, population size is too big, leading
to resources waste, extending the searching process
- Qbjective/fitness function: This function evaluates how good the solution
is, therefore it needs to be built carefully
- Operators [22]:
o Mutation: The goal is to define searching mechanism of the algorithm, generating mutant vectors This mutant factor (F) decides how many perturbation ratios the solution can obtain If mutant factor is great, the size of jump will be increased That is, the population can break away the regional optimum effectively