Information Extraction for Financial Analysis
Trang 1ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC CÔNG NGHỆ
CÔNG TRÌNH DỰ THI GIẢI THƯỞNG “SINH VIÊN NGHIÊN CỨU KHOA HỌC”
NĂM 2012
Tên công trình: Information Extraction for Financial Analysis
Họ và tên sinh viên: Lê Văn Khánh Nam, Nữ: Nam
Giảng viên hướng dẫn: TS Phạm Bảo Sơn
HÀ NỘI - 2012
Trang 2Abstract
Today, a lot of useful information on the World Wide Web which is usually
formatted for its users is difficult to extract relevant data from various sources Therefore, Information Extraction was born to solve this problem Recently,
flexible Information Extraction (IE) systems that transform the information
resources into program-friendly structures such as: a relational database, XML, etc… will become a great necessity In this report, we present a problem which is applying Information Extraction for Financial Analysis The main goal is how to extract the information from a thousand of financial reports written in different formats We also present a systematic approach in building rules to recognize In particular, we also evaluate the performance of our system
Trang 3Contents
Chapter 1 5
Introduction 5
1.1 Subject Overview 5
1.2 Information Extraction 6
1.3 Report Structure 8
Chapter 2 9
Approaches in Information Extraction 9
2.1 Manually Constructed IE Systems 9
2.1.1 TSIMMIS tool 9
2.1.2 W4F 10
2.1.3 XWRAP 10
2.2 Supervised IE Systems 10
2.2.1 SRV 11
2.2.2 RAPIER 11
2.2.3 WHISK 11
2.3 Semi-Supervised IE Systems 12
2.3.1 IEPAD 12
2.3.2 OLERA 12
2.4 Unsupervised IE Systems 13
2.4.1 RoadRunner 13
2.4.2 DEPTA 14
Chapter 3 16
Our Approach 16
3.1 Problem Formalization 16
3.1.1 HTML Mode 16
3.1.2 Plain-text Mode 18
3.2 Approaches & Implementation 19
3.2.1 HTML Mode 19
3.2.1.1 Preprocessing 20
3.2.1.2 Extracting 22
Trang 43.2.1.3 Finalizing 24
Chapter 4 25
Experimental setup and Evaluations 25
4.1 Evaluation Metrics 25
4.2 Corpus Development 25
4.3 Evaluation Criteria 26
4.4 Training Process 27
4.5 Testing Process 28
Chapter 5 30
Conclusion & Future work 30
Reference 31
Trang 5a great deal of information, however, people just need brief but essential information in order to quickly understand what are performed in these report For example, there is a document like Figure 1.1 , and then outputs as Figure 1.2 As Figure 1.3 shows general scenarios of our work
Figure 1.1 A report
Figure 1.2 Excel format
Trang 6Figure 1.3 Scenarios
In the scenarios (Figure 1.3), we only concentrate on step 1: processing reports to get such output like Figure 1.2, and then got such ouputs the technical finance will analysis at
step 2 To sum up, our task is applying Information Extraction for Financial Analysis
to get such output (e.g Figure 1.2 given above)
1.2 Information Extraction
Information extraction (IE) is originally applied to identify desired information from natural language text and convert them into a well-defined structure, e.g., a database with particular fields With the huge and rapidly increasing amount of available information sources and electronic documents on the world wide web, information extraction is
extended for identification from structured and semi-structured web pages Recently, more and more research groups concentrate their attention on development of
information extraction systems, such as: web-mining, question answering Researches on information extraction could be divided into two subareas: the extraction patterns used for identification of target information from given text, and using machine learning
techniques to automatically build such extraction patterns for the sake of avoiding
expensive construction by hand Actually, a lot of information extraction systems have been successfully implemented, and part of them perform very well To be specific, Figure 1.4 shows an example of information extraction Given document of seminar
announcement, the entities Date, Start-time, Location, Speaker and Topic could be
specified
Trang 7Figure 1.4 Information Extraction for Seminar Announcement Formally, an IE task is defined by its input and its extraction target The input can be unstructured documents like plain-text that are written in natural language (e.g Figure 1.4) or the semi-structured documents that are popular on the Web, such as tables or itemized and enumerated lists (e.g Figure 1.5)
Figure 1.5 A Semi-structured page containing data records (in rectangular box) to be
extracted
Trang 8The extraction target of an IE task can be a relation of k-tuple (where k is the number of fields/attributes in a record) or it can be a complex object with hierarchically organized data For some IE tasks, an attribute may have zero (missing) or multiple instantiations in
a record The IE systems to be also called as extractors or wrappers
As a result, the traditional IE systems usually use some main approaches as: rules-based, machine learning and pattern mining techniques to exploit the information
1.3 Report Structure
Our report is organized as following First, in Chapter 2, we introduce IE systems in information extraction domain and also review some of the solutions that have been proposed In the next Chapter 3, we then describe our approach and system
implementation Chapter 4 describes the experiment we carry out to evaluate the quality
of our approach Finally, Chapter 5 is conclusion and our future work
Trang 9Chapter 2
Approaches in Information Extraction
As we found out that earlier IE systems are designed to facilitate programmers in writing extraction rules, while later IE systems take machine learning to generate automatically rules generalization Such systems have different degree of automation and accuracy
Therefore, the IE systems can be classified into the four classes: manually-constructed IE Systems, supervised IE Systems, semi-supervised IE Systems and unsupervised IE
Systems
2.1 Manually Constructed IE Systems
In manually-constructed IE systems, users create a wrapper for each input by hand using general programming languages such as Java, Python, Perl, etc… or by using special designed languages Hence, these tools require expert developer to have substantial computer and programming backgrounds, so it becomes expensive Such systems include
TSIMMIS [1], W4F [2] and XWRAP [3]
2.1.1 TSIMMIS tool
The main component of this tool is a wrapper that takes as input a specification file that declaratively states For example, Figure 2.1(a) shows the specification file
Figure 2.1 A TSIMMIS specification file and (b) the OEM output
Each command is of the form: [variables, source, pattern], where source specifies the input text to be considered, pattern specifies how to find the text of interest within the source, and variables are a list of variables that hold the extracted results The special
Trang 10symbol “*” in a pattern means discard, and “#” means save in the variables TSIMMIS then outputs data in Object Exchange Model (e.g Figure 2.1(b)) that contains the
extracted data together with information about the structure and the contents of the result
TSIMMIS provides two important operators: split and case The split operator is used to
divide the input list element into individual element The case operator allows user to handle the irregularities in the structure of the input pages
2.1.2 W4F
W4F stand for Wysiwyg Web Wrapper Factory which is Java toolkit to generate Web
wrappers The wrapper development process consists of three independent layers:
retrieval, extraction and mapping layers In the retrieval layer, a document is retrieved
(from the Web through HTTP protocol), cleaned up and then parse into a tree following the Document Object Model (DOM tree) [5] In the extraction layer, extraction rules are applied on the DOM tree to extract information and then store them into internal format called Nested String List (NSL) In the mapping layer, the NSL structures are exported to the upper-level application according to mapping rules Extraction rules are expressed using the HEL (HTML Extraction Language), which uses the HTML parse tree (i.e DOM tree) path to address the data to be located For example, users can use regular
expression to match or split (following the programming language syntax) the string
obtained by DOM tree path
algorithm is used here
2.2 Supervised IE Systems
Supervised WI systems take a set of inputs labeled with examples of the data to be
extracted and output a wrapper The user provides an initial set of annotated examples to train system For such systems, general users instead of programmers can be trained to
use the labeling GUI, thus reducing the cost of wrapper generation: SRV[4], RAPIER[6], WHISK [12]
Trang 112.2.1 SRV
A top-down relational algorithm that generates single-slot extraction rules The input documents are tokenized and all substrings of continuous tokens are labeled as either extraction target or not The rules generated by SRV are logic rules that rely on a set of
token-oriented features which can be either simple or relational.A simple feature is a function that maps a token into some discrete value such as length, character type (e.g., numeric), orthography (e.g., capitalized) and part of speech (e.g., verb) A relational
feature maps a token to another token, e.g the contextual (previous or next) tokens of the input tokens
2.2.2 RAPIER
RAPIER focuses on field-level extraction but uses bottom-up (compression-based)
relational learning algorithm For instance, it begins with the most specific rules and then replacing them with more general rules RAPIER learns single slot extraction patterns that make use of syntactic and semantic information including part-of-speech tagger or a lexicon (WordNet) It also uses templates to lear extraction pattern The extraction rules contain 3 distinct patterns The first one is the pre-filler pattern that matches text
immediately preceding the filler, the second one is the pattern that match the actual slot filler, finally the last one is the post-filler pattern that match the text immediately
following the filler As an example, Figure 2.2 shows the extraction rule for the book title,
which is immediately preceded by words “Book”, “Name”, and “</b>”, and immediately followed by the word “<b>”
Figure 2.2 RAPIER extraction rule
2.2.3 WHISK
WHISK uses a covering learning algorithm to generate multi-slot extraction rules for a wide variety of documents ranging from structured to free text When applying to free text, WHISK works best with input that has been annotated by a syntactic analyzer and a semantic tagger WHISK rules are based on a form of regular expression patterns that identify the context of relevant phrases For structured or semi-structured text, a text is broken into multiple instances based-on HTML tags or other regular expression For free-text, a sentence analyzer segments the text into instances where each instance is clause, sentence or sentence fragment Another pre-processing may be automatically adding
Trang 12semantic tag or syntactic annotation WHISK begins with untagged instances and an empty training tagged instances At each iteration of WHISK, set of untagged instances is selected and presented to user to annotate WHISK creates rule from a seed instance as Figure 2.3
Figure 2.3 Creating rule from seed instance
algorithm to align multiple strings which start from each occurrence of a repeat and end before the start of next occurrence
2.3.2 OLERA
OLERA acquires a rough example from the user for extraction rule generation.It can learn extraction rules for pages containing single data records OLERA consists of 3 main
operations (1) Enclosing an information block of interest: where the user marks an
information block containing a record to be extracted for OLERA to discover other
similar blocks and generalize them to an extraction pattern (using multiple string
Trang 13alignment technique) (2) Drilling-down/rolling up an information slot: drilling-down
allows the user to navigate from a text fragment to more detailed components, whereas
rolling-up combines several slots to form a meaningful information unit (3) Designating relevant information slots for schema specification as in IEPAD
Therefore, generating a wrapper for a set of HTML pages corresponds to inferring a
grammar for the HTML code The system uses the ACME (Align, Collapse, Match & Extract) matching technique to compare HTML pages of the same class and generate a
wrapper based on their similarities and differences It starts from comparing two pages, using the ACME technique to align the matched tokens and collapse for mismatched
tokens There are two kinds of mismatches: string mismatches that are used to discover attributes (#PCDATA) and tag mismatches that are used to discover iterators (+) and
optional (?).Figure 2.4 shows both an example of matching for the first two pages of the running example and its generated wrapper To reduce the complexity, RoadRunner adopts UFRE (union-free regular expression)
Trang 14Figure 2.4 Matching the first two pages of the running example
comparison, (b1, b2) and (b2, ol), under parent node <body>, where the tag string node
<ol> is represented by “<li><b><b><b><li><b><b><b>” If the similarity is greater than a predefined threshold (as shown in the shaded nodes in Figure 2.5), the nodes are recorded as data regions The third step is designed to handle situations when a data record is not rendered contiguously as assumed in previous works
Trang 15Figure 2.5 The tag tree (left) and the DOM tree (as a comparison)
Finally, the recognition of data items or attributes in a record is accomplished by partial tree alignment [8]
Trang 16in both HTML document and plain-text mode To address this difficulty, we divide our
problem into two tasks:
Extract from HTML mode
Extract from plain-text mode
In each task, we apply different techniques to solve, and will be talked later on
3.1.1 HTML Mode
In this mode, the input contains a lot of information wrapped in HTML tags, but we only
one need main information and ignore anyelse The desired information is putted in table
tag (e.g Figure 3.1)
Trang 17Figure 3.1 (b) The input rendering in browser There is an example of input, and what’s about the output for this situation? In order to answer the question, see Figure 3.2 and such information should be extracted together is that a tuples look like:
<Common stocks; INFORMATION TECHNOLOGY; Cisco Systems; 24,704,300; 660,099>
<Common stocks; INFORMATION TECHNOLOGY; Microsoft Corp; 21,395,000; 605,906>
<Common stocks; INFORMATION TECHNOLOGY; Intel Corp; 18,267,000; 423,429>
< >
Figure 3.2 Extracted Information
Trang 183.1.2 Plain-text Mode
As we have described early about HTML mode, it seems HTML mode is not too difficult
to extract At this part, the plain-text mode is a big challenge with us They have
differences compared-with HTML mode, there is no tag like HTML tag which wrapper entities (as Figure 3.3 below) That is to say, we cannot depend on those tags as HTML mode for extracting
Figure 3.3 The input of plain-text mode
Even in this mode which is written negligently in which using both white-space and
tab together to separate between entities This make more complicated to tokenize these
entities
After examining serveral, we have to fomulate the problem into a standard template for both mode The template should be annotated with these tags below: