Báo cáo khoa học: "Multilingual adaptations of ANNIE, a reusable information extraction tool" doc

Multilingual adaptations of ANNIE, a reusable information extraction tool Diana Maynard, Hamish Cunningham Dept of Computer Science University of Sheffield Sheffield, Si 4DP, UK diana@dc

Trang 1

Gate Unicade Editar - Unicade Sampler-tat

File Edit Options Help

H IL IC[411 'Anal UrocodeirIS 9 ■

Arabic: 13.0 3 zp- J 11 JSI J S JI9 U.

Armenian: tipEnni Duval& lau.tht 11.6.241 laUlnalagbula

Chinese: 1 - kkl' ,3 tT• -ftals , c.

Farsi I Persian: j3

Georgian: 8063b 3368 66).5 863033.

Hebrew: - 5 pnn s5 no royipt 5m5 513 - 'PC

Hindi: 4 isbi T 1tsd I 31 t 4tsi d4f

Japanese: f1 , 14/37 tIt's:641 - 1 - 0 -t-tiafokt*-)(-mth „ Korean: 1-1 - L -W-21ff VoiR.

Marathi: 4t ti -1 /9113) 9I - Wa, 1: MT TA - d - ff41 - Pashto: .41 4 9 aseP

Sanskrit: vITH I Thai: iuAttn9t - anlei

Urdu: - bcej31 Las"

Yiddish: UEP1 - rn Lou DY px w5a Inv VP TS.

Multilingual adaptations of ANNIE, a reusable information

extraction tool

Diana Maynard, Hamish Cunningham

Dept of Computer Science University of Sheffield Sheffield, Si 4DP, UK

diana@dcs.shef.ac.uk

Abstract

In this demo we will present GATE, an

archi-tecture and framework for language

engineer-ing, and ANNIE, an information extraction

sys-tem developed within it We will demonstrate

how ANNIE has been adapted to perform NE

recognition in different languages, including

In-dic and Slavonic languages as well as Western

European ones, and how the resources can be

reused for new applications and languages

1 Introduction

GATE' is an architecture, development

envi-ronment, and framework for building systems

that process human language (Cunningham et

al., 2002; Maynard et al., 2002) It has been

in development at the University of Sheffield

since 1995, and has been used for many R&D

projects, including Information Extraction in

multiple languages and media, and for multiple

tasks and clients GATE is available freely, as

an open source system, under the GNU library

licence, and has been downloaded by around

2500 sites worldwide The core architecture and

some applications developed within GATE have

been previously demonstrated (Cunningham et

al., 2002); however, this demonstration will

fo-cus on the multilingual aspects of GATE, and

adaptations of its IE system for different

lan-guages

Version 2 of GATE has a large number of

added features from the previous version, such

as:

• comprehensive multilingual support via

Unicode

'This work has been supported by the Engineering

and Physical Sciences Research Council (EPSRC)

un-der grants GR/K25267 and GR/M31699, and by several

smaller grants.

• tools for performance evaluation

• support for manual annotation

• reusable visualisation components

• database support (Oracle, PostgresQL)

• support for distributed resources from the Web

• comprehensive document format support (SGML, XML, HTML, RDF, email, plain text)

Figure 1: Unicode text in Gate2

2 Processing Resources

GATE provides a baseline set of reusable and extendable language processing components for common NLP tasks, known collectively as AN-NIE (A Nearly New Information Extraction System) These include a Unicode tokeniser, sentence splitter, POS tagger, gazetteer, seman-tic tagger, name coreferencer (orthomatcher)

Trang 2

IMMFis AnnotatiooidA

Din surnarui Nr.336/1111110.2002

• Gnidul calatoriilorfa4a vita in ball Bilandul activitadii poobor buzodani in anul 111.(partea b•Polidistii au posibOtatea legislativa de aid indeplini mai blue

sarcinileii misnimle cc le revin• A incep-ut cnn nou an de lupta cu infractorii: Printre

primii la calatorli fara vita, 5eribedn Nadine din a font prinsa pa aerop-ort incercand nO fuga de datorii neonorate de circa Un het din biserici font intuit tocmai de pazni-cul care l-a prins/S-a spart biblioleca Casei de culkira dar hodii RIJ s-au afire de nici o carte/Explozie la sonda 304 Grajdana •Jurnal rutier

- Ultima zi din an sub semnul imprudendei Somne papilare" de Col Nicotae Rutaru • "kitrnet pas cu pas" de Adrian firlacdru4-Tanasescu•invataturi din arhivele brancovenes.ti•

'MOW

and pronominal coreferencer For more details,

see (Cunningham et al., 2002) ANNIE

cur-rently produces precision and recall figures for

named entity recognition of around 90%,

de-pending on the text type

An online demo of ANNIE is available at

http://gate.ac.uk/a,nnie/index.jsp A set of

movies demonstrating document and corpus

loading, processing and storing, manual

an-notation of documents and corpora,

creat-ing, runncreat-ing, saving and restoring

applica-tions and viewing their results is available at

http: / /www.gat e ac uk/ demos / movies ht ml

3 Multilingual support - the GATE

Unicode Kit

GATE is one of the few architectures to

sup-port multilingual processing, using Unicode as

its default text encoding A Unicode enabled

two main issues: the capability to display text

and the ability to enter text in other languages

than the default one

It also provides a means of entering text in

a variety of languages and scripts, using virtual

keyboards where the language is not supported

by the underlying operating platform (Java

it-self does not support input in many languages

covered by Unicode, although it supports

Uni-code representation) Figure 1 depicts text in

various scripts displayed in GATE The

facili-ties have been developed as part of the EMILLE

project (Baker et al., 2002), designed to

con-struct a 63 million word corpus of South Asian

languages There are currently 28 languages

supported in GATE, and more are planned for

the future Since GATE is an open architecture,

new virtual keyboards can easily be defined by

users and added to the system

Apart from the input methods, GUK also

provides a simple Unicode-aware text editor

which is important because not all platforms

provide one by default or the users may not

know which one of the already installed

edi-tors is Unicode-aware Besides providing text

visualisation and editing facilities, the GUK

ed-itor also performs encoding conversion

opera-tions The editor has proved a useful tool

dur-ing the development and testdur-ing of GATE in a

cross-platform environment, while the ability to

handle Unicode enables applications developed

within GATE to be easily ported to new lan-guages

4 The future isn't English

Robust tools for multilingual information ex-traction are becoming increasingly sought af-ter now that we have capabilities for processing texts in different languages and scripts While the default IE system is English-specific, some

of the modules can be reused directly (e.g the Unicode-based tokeniser can handle Indo-European languages), and/or easily customised for new languages (Pastra et al., 2002) So far, ANNIE has been adapted to do IE in Bulgarian, Romanian, Bengali, Greek, Spanish, Swedish, German, Italian, and French, and we are cur-rently porting it to Arabic, Chinese and

Figure 3: Romanian news text annotated in GATE

4.1 NE in Slavonic languages The Bulgarian NE recogniser (Paskaleva et al., 2(02) was built using three main processing re-sources: a tokeniser, a gazetteer and a semantic grammar built using JAPE There was no POS tagger available in Bulgarian, and consequently

we had no need of a sentence splitter either The main changes to the system were in terms 2http://www.des.shefac.uk/research/groups/n1p/muse/

Trang 3

al Gate

17-1111 Applications

- ej , BG system

CI Language Resources

= A Text EiG Processing Resources

BG Names t" ■ Unicode Tokeniser

= BG Gazetteer

gData stores

Messages I di BO system Text BO I

Emol- apgslee aerompe 8 METO4HEI EBFIOna AHM,I9 e H8 crepes reproahuln m chpaagnn ca e Sanagaa Espana.

tR - I'M K IHNHI X8M

D - r M Cunningham

Dr. M Cunningham D-rAHrenoaa

Lookup ,Default, 71 78 trninorType=country, majorType=location}

Lookup ,Default> 60 68 trninorType=country, majorType=location}

Person ,Default> 210 216 tkInd=personName, rule=KallnaPersonl Person ,Default> 173 179 tkInd=personName, rule=HamishPerson}

Person ,Default> 148 154 tkInd=personName, rule=HamishPerson}

Person ,Default> 108 113 tkind=personName, rule.HamishPersonBet Person ,Default> 266 270 tkind=personName, rule=GaliaPersont Person ,Default> 248 253 tkind=personName, rule=GaliaPersont Person ,Default> 230 236 tkind=personName, rule=KalinaPersonl

E <Default>

1—T7 Lookup

F

SpaceToken

DA Gate 2.0alphal

File Edit Tools Help

Figure 2: Bulgarian named entities in GATE

of the gazetteer lists (e.g lists of first names,

days of the week, locations etc were tailored for

Bulgarian), and in terms of some of the pattern

matching rules in the grammar For example,

Bulgarian makes far more use of morphology

than English does, e.g 91% of Bulgarian

sur-names could be directly recognised using

mor-phological information The lack of a POS

tag-ger meant that many rules had to be specified in

terms of orthographic features rather than parts

of speech Figure 2 shows some Bulgarian text

annotated in GATE

Since the structure of the Bulgarian and

Russian languages is quite similar, we

antic-ipate that converting the Bulgarian system

to Russian will be fairly straightforward, and

will involve mostly replacing and/or updating

gazetteer lists at least to obtain comparable

results

4.2 NE in Romanian

2002) was developed from ANNIE in a

simi-lar way to the Bulgarian one, using a tokeniser,

gazetteer and a JAPE semantic grammar

Fig-ure 3 shows some Romanian text annotated in

GATE

Romanian is a more flexible language than

English in terms of word order; it is also

agglu-tinative e.g definite articles attach to nouns, making a definite and indefinite form of both common and proper nouns

As with Bulgarian, the tokeniser did not need

to be modified, while the gazetteer lists and

which were fairly minor For both Bulgar-ian and RomanBulgar-ian, the modifications necessary were easily implemented by a native speaker who did not require any other specialist skills beyond a basic grasp of the JAPE language and the GATE architecture No Java skills or other programming knowledge was necessary The Gate Unicode kit was invaluable in enabling the preservation of the diacritics in Romanian, by saving them with UTF-8 encoding

4.3 NE in other languages ANNIE has also been adapted to perform NE recognition on English, French and German

of which is shown in Figure 4 Since French and German are more similar to English in many ways than e.g Slavonic languages, it was very easy to adapt the gazetteers and grammars ac-cordingly

3http://www.dcs.shcf.ac.uk/nlp/amitics

Trang 4

File

AMITIES: Amities, how can I help, je vous ecoute, was lcann ich ftir Sie tun?

USER: Guten Tag, Ich mochte gem wissen, movie! ich out metnem Konto hate

AMITIES: Darf ich bitte Ihre Kontonummer haben?

USER: Mon numero est 5522333344945555

AMITIES: Pouvez-vous me donner votre adresse et votre date de naissance, s'il

vous plait?

USER: I live at 547 Ratchet Lane, and my date of birth is the 18th of May, 1969

AMITIES: Thank you Here is the information about your account.

Figure 4: AMITIES multilingual dialogue

4.4 Surprise languages

We are currently investigating methods of

adapting ANNIE to new languages with the

minimum of resources and time Our previous

experiments with languages other than English

have demonstrated that we can get reasonable

results in around 2 person months using a

na-tive speaker and hand-coded semantic tagging

rules, without requiring resources such as

dictio-naries or POS taggers for that language We are

also planning participation in the TIDES-based

"surprise language experiment", which requires

various NLP tasks such as IE, IR,

summarisa-tion and MT to be carried out in a month on

a surprise language, the nature of which will

not be known in advance The open and

flex-ible architecture of GATE, and the separation

of linguistic data from processing makes it an

ideal environment within which to perform such

a task Any available linguistic resources such

as dictionaries and POS taggers can be simply

plugged into the model, but if these are not

available we can simply modify other

compo-nents as necessary

5 Conclusion

In this demo we have shown how an existing set

of IE tools has been modified to a diverse set

of languages with minimum overhead The

ad-vantage of having such low-overhead portability

is that it enables quick deployment of IE tools

with acceptable performance, which, even if not

to bootstrap the creation of IE-annotated

cor-pora and/or facilitate the training of learning

tools for adaptive IE In addition, some

adap-tive IE tools are now using the ANNIE

corn-ponents to provide them with richer linguistic information (Ciravegna et al., 2002)

References

P Baker, A Hardie, T McEnery, H Cunning-ham, and R Gaizauskas 2002 EMILLE,

A 67-Million Word Corpus of Indic Lan-guages: Data Collection, Mark-up and

Har-monisation In Proceedings of 3rd Lan-guage Resources and Evaluation Conference (LREC'2002), pages 819-825.

F Ciravegna, A Dingli, D Petrelli, and

Y Wilks 2002 User-System Cooperation in Document Annotation Based on Information

Extraction In 13th International Confer-ence on Knowledge Engineering and Knowl-edge Management (EKAW02), pages

122-137, Siguenza, Spain

H Cunningham, D Maynard, K Bontcheva, and V Tablan 2002 GATE: A Framework and Graphical Development Environment for

Robust NLP Tools and Applications In Pro-ceedings of the 40th Anniversary Meeting of the Association for Computational Linguis-tics.

0 Hamza, D Maynard V.Tablan, C Ursu,

H Cunningham, and Y Wilks 2002 Named Entity Recognition in Romanian Techni-cal report, Department of Computer Science, University of Sheffield

Y Wilks 2002 Architectural elements of

language engineering robustness Journal of Natural Language Engineering - Special Is-sue on Robust Methods in Analysis of Natural Language Data, 8(2/3):257-274.

K Bontcheva, H Cunningham, and

Y Wilks 2002 Slavonic named enti-ties in gate Technical Report CS-02-01, University of Sheffield

Katerina Pastra, Diana Maynard, Hamish Cun-ningham, Oana Hamza, and Yorick Wilks

2002 How feasible is the reuse of grammars

for named entity recognition? In Proceed-ings of 3rd Language Resources and Evalu-ation Conference.

Tiêu đề	Multilingual adaptations of annie, a reusable information extraction tool
Tác giả	Diana Maynard, Hamish Cunningham
Trường học	University of Sheffield
Chuyên ngành	Computer Science
Thể loại	báo cáo khoa học
Thành phố	Sheffield

Định dạng
Số trang	4
Dung lượng	710,78 KB