Information Extraction for Financial Analysis

Trang 1

ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC CÔNG NGHỆ

CÔNG TRÌNH DỰ THI GIẢI THƯỞNG “SINH VIÊN NGHIÊN CỨU KHOA HỌC”

NĂM 2012

Tên công trình: Information Extraction for Financial Analysis

Họ và tên sinh viên: Lê Văn Khánh Nam, Nữ: Nam

Giảng viên hướng dẫn: TS Phạm Bảo Sơn

HÀ NỘI - 2012

Trang 2

Abstract

Today, a lot of useful information on the World Wide Web which is usually

formatted for its users is difficult to extract relevant data from various sources Therefore, Information Extraction was born to solve this problem Recently,

flexible Information Extraction (IE) systems that transform the information

resources into program-friendly structures such as: a relational database, XML, etc… will become a great necessity In this report, we present a problem which is applying Information Extraction for Financial Analysis The main goal is how to extract the information from a thousand of financial reports written in different formats We also present a systematic approach in building rules to recognize In particular, we also evaluate the performance of our system

Trang 3

Contents

Chapter 1 5

Introduction 5

1.1 Subject Overview 5

1.2 Information Extraction 6

1.3 Report Structure 8

Chapter 2 9

Approaches in Information Extraction 9

2.1 Manually Constructed IE Systems 9

2.1.1 TSIMMIS tool 9

2.1.2 W4F 10

2.1.3 XWRAP 10

2.2 Supervised IE Systems 10

2.2.1 SRV 11

2.2.2 RAPIER 11

2.2.3 WHISK 11

2.3 Semi-Supervised IE Systems 12

2.3.1 IEPAD 12

2.3.2 OLERA 12

2.4 Unsupervised IE Systems 13

2.4.1 RoadRunner 13

2.4.2 DEPTA 14

Chapter 3 16

Our Approach 16

3.1 Problem Formalization 16

3.1.1 HTML Mode 16

3.1.2 Plain-text Mode 18

3.2 Approaches & Implementation 19

3.2.1 HTML Mode 19

3.2.1.1 Preprocessing 20

3.2.1.2 Extracting 22

Trang 4

3.2.1.3 Finalizing 24

Chapter 4 25

Experimental setup and Evaluations 25

4.1 Evaluation Metrics 25

4.2 Corpus Development 25

4.3 Evaluation Criteria 26

4.4 Training Process 27

4.5 Testing Process 28

Chapter 5 30

Conclusion & Future work 30

Reference 31

Trang 5

a great deal of information, however, people just need brief but essential information in order to quickly understand what are performed in these report For example, there is a document like Figure 1.1 , and then outputs as Figure 1.2 As Figure 1.3 shows general scenarios of our work

Figure 1.1 A report

Figure 1.2 Excel format

Trang 6

Figure 1.3 Scenarios

In the scenarios (Figure 1.3), we only concentrate on step 1: processing reports to get such output like Figure 1.2, and then got such ouputs the technical finance will analysis at

step 2 To sum up, our task is applying Information Extraction for Financial Analysis

to get such output (e.g Figure 1.2 given above)

1.2 Information Extraction

Information extraction (IE) is originally applied to identify desired information from natural language text and convert them into a well-deﬁned structure, e.g., a database with particular ﬁelds With the huge and rapidly increasing amount of available information sources and electronic documents on the world wide web, information extraction is

extended for identiﬁcation from structured and semi-structured web pages Recently, more and more research groups concentrate their attention on development of

information extraction systems, such as: web-mining, question answering Researches on information extraction could be divided into two subareas: the extraction patterns used for identiﬁcation of target information from given text, and using machine learning

techniques to automatically build such extraction patterns for the sake of avoiding

expensive construction by hand Actually, a lot of information extraction systems have been successfully implemented, and part of them perform very well To be specific, Figure 1.4 shows an example of information extraction Given document of seminar

announcement, the entities Date, Start-time, Location, Speaker and Topic could be

specified

Trang 7

Figure 1.4 Information Extraction for Seminar Announcement Formally, an IE task is defined by its input and its extraction target The input can be unstructured documents like plain-text that are written in natural language (e.g Figure 1.4) or the semi-structured documents that are popular on the Web, such as tables or itemized and enumerated lists (e.g Figure 1.5)

Figure 1.5 A Semi-structured page containing data records (in rectangular box) to be

extracted

Trang 8

The extraction target of an IE task can be a relation of k-tuple (where k is the number of fields/attributes in a record) or it can be a complex object with hierarchically organized data For some IE tasks, an attribute may have zero (missing) or multiple instantiations in

a record The IE systems to be also called as extractors or wrappers

As a result, the traditional IE systems usually use some main approaches as: rules-based, machine learning and pattern mining techniques to exploit the information

1.3 Report Structure

Our report is organized as following First, in Chapter 2, we introduce IE systems in information extraction domain and also review some of the solutions that have been proposed In the next Chapter 3, we then describe our approach and system

implementation Chapter 4 describes the experiment we carry out to evaluate the quality

of our approach Finally, Chapter 5 is conclusion and our future work

Trang 9

Chapter 2

Approaches in Information Extraction

As we found out that earlier IE systems are designed to facilitate programmers in writing extraction rules, while later IE systems take machine learning to generate automatically rules generalization Such systems have different degree of automation and accuracy

Therefore, the IE systems can be classified into the four classes: manually-constructed IE Systems, supervised IE Systems, semi-supervised IE Systems and unsupervised IE

Systems

2.1 Manually Constructed IE Systems

In manually-constructed IE systems, users create a wrapper for each input by hand using general programming languages such as Java, Python, Perl, etc… or by using special designed languages Hence, these tools require expert developer to have substantial computer and programming backgrounds, so it becomes expensive Such systems include

TSIMMIS [1], W4F [2] and XWRAP [3]

2.1.1 TSIMMIS tool

The main component of this tool is a wrapper that takes as input a specification file that declaratively states For example, Figure 2.1(a) shows the specification file

Figure 2.1 A TSIMMIS specification file and (b) the OEM output

Each command is of the form: [variables, source, pattern], where source specifies the input text to be considered, pattern specifies how to find the text of interest within the source, and variables are a list of variables that hold the extracted results The special

Trang 10

symbol “*” in a pattern means discard, and “#” means save in the variables TSIMMIS then outputs data in Object Exchange Model (e.g Figure 2.1(b)) that contains the

extracted data together with information about the structure and the contents of the result

TSIMMIS provides two important operators: split and case The split operator is used to

divide the input list element into individual element The case operator allows user to handle the irregularities in the structure of the input pages

2.1.2 W4F

W4F stand for Wysiwyg Web Wrapper Factory which is Java toolkit to generate Web

wrappers The wrapper development process consists of three independent layers:

retrieval, extraction and mapping layers In the retrieval layer, a document is retrieved

(from the Web through HTTP protocol), cleaned up and then parse into a tree following the Document Object Model (DOM tree) [5] In the extraction layer, extraction rules are applied on the DOM tree to extract information and then store them into internal format called Nested String List (NSL) In the mapping layer, the NSL structures are exported to the upper-level application according to mapping rules Extraction rules are expressed using the HEL (HTML Extraction Language), which uses the HTML parse tree (i.e DOM tree) path to address the data to be located For example, users can use regular

expression to match or split (following the programming language syntax) the string

obtained by DOM tree path

algorithm is used here

2.2 Supervised IE Systems

Supervised WI systems take a set of inputs labeled with examples of the data to be

extracted and output a wrapper The user provides an initial set of annotated examples to train system For such systems, general users instead of programmers can be trained to

use the labeling GUI, thus reducing the cost of wrapper generation: SRV[4], RAPIER[6], WHISK [12]

Trang 11

2.2.1 SRV

A top-down relational algorithm that generates single-slot extraction rules The input documents are tokenized and all substrings of continuous tokens are labeled as either extraction target or not The rules generated by SRV are logic rules that rely on a set of

token-oriented features which can be either simple or relational.A simple feature is a function that maps a token into some discrete value such as length, character type (e.g., numeric), orthography (e.g., capitalized) and part of speech (e.g., verb) A relational

feature maps a token to another token, e.g the contextual (previous or next) tokens of the input tokens

2.2.2 RAPIER

RAPIER focuses on field-level extraction but uses bottom-up (compression-based)

relational learning algorithm For instance, it begins with the most specific rules and then replacing them with more general rules RAPIER learns single slot extraction patterns that make use of syntactic and semantic information including part-of-speech tagger or a lexicon (WordNet) It also uses templates to lear extraction pattern The extraction rules contain 3 distinct patterns The first one is the pre-filler pattern that matches text

immediately preceding the filler, the second one is the pattern that match the actual slot filler, finally the last one is the post-filler pattern that match the text immediately

following the filler As an example, Figure 2.2 shows the extraction rule for the book title,

which is immediately preceded by words “Book”, “Name”, and “”, and immediately followed by the word “”

Figure 2.2 RAPIER extraction rule

2.2.3 WHISK

WHISK uses a covering learning algorithm to generate multi-slot extraction rules for a wide variety of documents ranging from structured to free text When applying to free text, WHISK works best with input that has been annotated by a syntactic analyzer and a semantic tagger WHISK rules are based on a form of regular expression patterns that identify the context of relevant phrases For structured or semi-structured text, a text is broken into multiple instances based-on HTML tags or other regular expression For free-text, a sentence analyzer segments the text into instances where each instance is clause, sentence or sentence fragment Another pre-processing may be automatically adding

Trang 12

semantic tag or syntactic annotation WHISK begins with untagged instances and an empty training tagged instances At each iteration of WHISK, set of untagged instances is selected and presented to user to annotate WHISK creates rule from a seed instance as Figure 2.3

Figure 2.3 Creating rule from seed instance

algorithm to align multiple strings which start from each occurrence of a repeat and end before the start of next occurrence

2.3.2 OLERA

OLERA acquires a rough example from the user for extraction rule generation.It can learn extraction rules for pages containing single data records OLERA consists of 3 main

operations (1) Enclosing an information block of interest: where the user marks an

information block containing a record to be extracted for OLERA to discover other

similar blocks and generalize them to an extraction pattern (using multiple string

Trang 13

alignment technique) (2) Drilling-down/rolling up an information slot: drilling-down

allows the user to navigate from a text fragment to more detailed components, whereas

rolling-up combines several slots to form a meaningful information unit (3) Designating relevant information slots for schema specification as in IEPAD

Therefore, generating a wrapper for a set of HTML pages corresponds to inferring a

grammar for the HTML code The system uses the ACME (Align, Collapse, Match & Extract) matching technique to compare HTML pages of the same class and generate a

wrapper based on their similarities and differences It starts from comparing two pages, using the ACME technique to align the matched tokens and collapse for mismatched

tokens There are two kinds of mismatches: string mismatches that are used to discover attributes (#PCDATA) and tag mismatches that are used to discover iterators (+) and

optional (?).Figure 2.4 shows both an example of matching for the first two pages of the running example and its generated wrapper To reduce the complexity, RoadRunner adopts UFRE (union-free regular expression)

Trang 14

Figure 2.4 Matching the first two pages of the running example

comparison, (b1, b2) and (b2, ol), under parent node <body>, where the tag string node

<ol> is represented by “<li><li>” If the similarity is greater than a predefined threshold (as shown in the shaded nodes in Figure 2.5), the nodes are recorded as data regions The third step is designed to handle situations when a data record is not rendered contiguously as assumed in previous works

Trang 15

Figure 2.5 The tag tree (left) and the DOM tree (as a comparison)

Finally, the recognition of data items or attributes in a record is accomplished by partial tree alignment [8]

Trang 16

in both HTML document and plain-text mode To address this difficulty, we divide our

problem into two tasks:

 Extract from HTML mode

 Extract from plain-text mode

In each task, we apply different techniques to solve, and will be talked later on

3.1.1 HTML Mode

In this mode, the input contains a lot of information wrapped in HTML tags, but we only

one need main information and ignore anyelse The desired information is putted in table

tag (e.g Figure 3.1)

Trang 17

Figure 3.1 (b) The input rendering in browser There is an example of input, and what’s about the output for this situation? In order to answer the question, see Figure 3.2 and such information should be extracted together is that a tuples look like:

<Common stocks; INFORMATION TECHNOLOGY; Cisco Systems; 24,704,300; 660,099>

<Common stocks; INFORMATION TECHNOLOGY; Microsoft Corp; 21,395,000; 605,906>

<Common stocks; INFORMATION TECHNOLOGY; Intel Corp; 18,267,000; 423,429>

< >

Figure 3.2 Extracted Information

Trang 18

3.1.2 Plain-text Mode

As we have described early about HTML mode, it seems HTML mode is not too difficult

to extract At this part, the plain-text mode is a big challenge with us They have

differences compared-with HTML mode, there is no tag like HTML tag which wrapper entities (as Figure 3.3 below) That is to say, we cannot depend on those tags as HTML mode for extracting

Figure 3.3 The input of plain-text mode

Even in this mode which is written negligently in which using both white-space and

tab together to separate between entities This make more complicated to tokenize these

entities

After examining serveral, we have to fomulate the problem into a standard template for both mode The template should be annotated with these tags below:

Tiêu đề	Information Extraction for Financial Analysis
Tác giả	Lê Văn Khánh
Người hướng dẫn	TS. Phạm Bảo Sơn
Trường học	Hà Nội University of Science and Technology
Chuyên ngành	Information Extraction for Financial Analysis
Thể loại	Dự án thi đua
Năm xuất bản	2012
Thành phố	Hà Nội

Định dạng
Số trang	32
Dung lượng	1,45 MB