Báo cáo khoa học: "The Corpora Management System Based on Java and Oracle Technologies" potx

The Corpora Management System Based on Java and Oracle Technologies Serge Yablonsky Petersburg Transport University, Computer Department, Moscow ay., 9, St.-Petersburg, 190031, Russia Ru

Trang 1

The Corpora Management System Based on Java and Oracle Technologies

Serge Yablonsky

Petersburg Transport University, Computer Department, Moscow ay., 9,

St.-Petersburg, 190031, Russia Russicon Company, Kazanskaya str., 56, ap.2, St.-Petersburg, 190000, Russia

serge yablonsky@hotmail.com ; root@russicon.spb.su;

http://www.russicon.ru

Abstract

The paper discusses the corpora

manage-ment system (CMS) design that uses Java

and Oracle9i DBMS to support strategic

corpora analysis We present the pilot

web-based CMS to support linguists in their

daily work The system offers facilities to

assist linguists and internet users as they

search for relevant material, and then

clas-sify and annotate this material

1 Introduction

There's a wide class of documental management

solutions and products that fall under the rubric

"corpora and text mining" They are similar to data

mining solutions in that they deal with large

vol-umes of data, but the difference between the two

technology solutions is that while data mining

ex-tracts, analyzes, and summarizes numerical,

struc-tured data, text mining handles large volumes of

unstructured, text-based data Document systems

with large-scale linguistic annotation are used by a

wide range of research and commercial

applica-tions

This paper presents a web-based text corpora

velopment system (CMS) that focuses on the

de-velopment of UML-specifications, architecture and

actual implementations of DBMS tools to support

strategic corpora analysis

We present the basic features of a prototype

cor-pora management system under development

in-tended to support linguists in their daily work The

system offers facilities to assist linguists and inter-net users as they search for relevant material, and then classify and annotate this material in a reposi-tory

The CMS is implemented using Java and commer-cial DBMS Oracle9i

2 System Overview

The Corpora management system combines Java, XML, XSL, HTML, and Oracle9i components (Yablonsky S.A., 2002) The system was by adapt-ing existadapt-ing and new DBMS and Java tools to the necessities of the intended task for the Russian language (Yablonsky S.A., 2000)

CMS consists of such main parts (see figure 1): Corpora — files in more then 150 different for-mats (doc, rtf, pdf, htm, xml, etc.);

Annotated corpora — files in XML format us-ing XML Corpus Encodus-ing Standard (XCES, http://www.xml-ces.org — Ide, N & Brew, C.,

2000, Ide, N., Romary L., 2001) and text for-mats;

Oracle 9i DBMS Enterprise Sever (Release 2) with such main counterparts:

o Grammatical dictionaries (inflec-tion paradigms of the given lan-guage) for the languages that are not supported by Oracle Text;

o Ontologies / WordNet (Fellbaum C.) / Domain Thesauruses for given languages;

o Word Index (index of all entry words or lemmas of Corpora);

Trang 2

Search Engine

Concept Clustering (Ontology Thesaurus) Document Management System Graphical Display and Navigation

DBMS Oracle 9iR2 ( -C-cirp -Cr— a (PC, Internet/intrar

- Ontology database

- Grammatical Dictionary XML, HTML, Doc, PDF etc.

Web-server (Apache 2.x + Tomcat 4.1.x)

Annotated Corpora

XML

41101rn et/I ntraneNP

Figure 1 CMS structure

O Oracle Text.

Web Server (for example, Apache 2.x plus

Tomcat 4.1.x) with such Java Server Pages

(JSP) counterparts:

O Search Engine;

o Concept Clustering;

o Document Management System;

o Interface Subsystem.

We present the fragment of CMS

UML-specification built by Rational Rose (Rational Rose

Enterprise Edition Documentation, 2001) that

could be expanded in future by community of

lan-guage and speech resources developers (see figure

2)

Here, for example, the relational table DOCS con-tains such attributes:

DOCS_NAME — document name;

DOCS_AUTHOR — document author;

DOCS_HTTP — document path;

DOCS_LANG — document language;

DOCS_LANG_CODE — document coding; DOCS_DATE — date of including in the Cor-pora;

DOCS_EXTENSION — document file exten-sion

The set of different types of UML-specifications brings us to full three-level system, including user, business and data services

Trang 3

DOC S_NO = DOCS_NO

<<R elat onaqalJe>>

WORD_INDEX

S NO P< _VVOR D VAR C H AR 2

#W_N EVV NUMBER IZ•V\I_VVORD_PK = VVI_WORD

<<F K>>

RESULTS D

<<FE>>

RESULTS QUERIES PC

ORY_NO = QRY_NO

«Relalionakiatle»

R ESULTS

•D OC S_N 0 NUMBER

R ES PAGE NO NU VEER DOCS_NO_P1 , ( = DOC S_N 0

OD 0 CS_

#00 CS_

#12)0 CS_

#00 CS_

#RUBRI

#00 CS

#D 0 CS_ LAN G_ CODE : VAR CHAR 2

#11) 0 CS_ DATE : DATE

tOCS_EXTEN SION : VAR CH AR 2

0 CS_ NANE_ PK = D 0 CS_ NAME

OCS_N0 _IN D = DOCS_N 0

NUMBER

NAIvE VARC HAP 2

AU TH OR VAR CH AR 2

HTTP VAR CHAR 2

CS_N 0 NUMBER

LANG VAR CHAR 2

<<R ationaiTabl e>>

RES_ADV_SEAR

#DOCS_N 0 NUMBER

A r#SEARCH_ID = SEAR C H_E)

<<RelationalTable>>

RU BR ICS

.1.RU BR ICS_N 0 : NUMBER

RU BR ICS_PAR ENT : NUMBER

#RU BR I CS_ N AME : VAR CH AR 2

tU BR ICS_DESC : VAR C HAP 2

U BR ICS_N AME_PK = RUBRICS_NAME UBRICS_N O_IND = RUBRICS_NO

<<R narraNe>>

QUERIES

#QP Y_N 0 NUMBER

#QP Y_QU ER Y VAR CHAR 2

#OR Y_R B1PrO NUMBER

#QR Y_EXTEN SION VAR C HAR 2

#OR Y_LANG VAR CH AR2

#QR Y_DATE VAR CHAR 2

#QR Y_IN CLUDE

1. 4. QUER ES_ Q UER Y_PE = OR Y_QUERY

4 r#QU ER ES_N O_IND = QR Y NO

<<R e I ati anal Tabl >>

DO C S

03

<<F1-4>>

DOGS RUBRICS N 0_ FK RUBRICS NO = RUBRICS_NO

REF ADV 3 ER

DOD S_NO = DOD S_NO

Figure 2 Fragment of UML notation of data model

UML specification of business services uses

stan-dard UML notations of stanstan-dard linguistic

annota-tion and corpora manipulaannota-tion procedures

Reusability of linguistic and corpora manipulation

business services could be achieved by usage of a

widely accepted set of UML notation standards for

corpus-based work in natural language processing

applications

3 System Features

The powerful search engine of prototype system

particular uses advantages of Oracle Text search

and services and includes (Oracle9i Database

Documentation):

• Content-based retrieval on free text with both

literal (word) predicates and thematic

predi-cates It includes: a comprehensive range of

operators and index preferences (e.g Boolean, exact phrase match, proximity, section search-ing, fuzzy, stemmsearch-ing, wildcard, thesaurus, stopwords, case sensitivity, and search scor-ing), "about" search, structured search, broad document format support and multi-language support For example, texts can be searched for stems of words e.g "teach" would return

"teaching", "taught" etc Fuzzy match can be used if you are not sure of the spelling of a word A search can be done to find words which are close to each other within a word Documents can be searched on what a docu-ment is about as opposed to the existence of specific words Gists or Theme Summaries can

be produced which produce a summary of what a document is about (using themes)

Trang 4

• For XML framework XPath searching enables

sophisticated queries which can reference and

leverage the embedded structure of XML

documents — instead of using a text query to

find documents, you use a document to fmd

queries XML path searching is able to

per-form sophisticated section searches: doctype

disambiguation, attribute value searching,

automatic section indexing, and more

In addition to the search capabilities, a number of

other features are provided to simplify application

development

• Corpora Format Support In order to index

documents stored in a variety of native

for-mats, such as Word, Excel, PowerPoint,

WordPerfect, HTML, and Acrobat/PDF,

sys-tem supplies a broad variety of "filters" that

al-low documents stored in their native formats to

be indexed Support for more then 150 file

for-mats in order to index files in a large range of

formats including Word, Acrobat, HTML,

WordPerfect, Powerpoint, Excel Flexible

Stor-age Location - documents can be stored and

indexed in the database, in a location pointed

to by a URL or in an external file

• Corpora Graphical Display and Navigation.

System services can convert any supported

document format to either plain text or

format-ted text (an HTML approximation retaining as

much as possible of the original formatting;

available for all formats except PDF) Both

plain text and HTML versions may be viewed

in a standard browser

• Document Management System System

sup-plies an administration tool through which all

major text maintenance and administration

functions may be performed

• Concept Clustering identifies the relationships

between phrases of the texts and it builds a

"lexical network," grouping related phrases

and enhancing the most important features of

these groupings The resulting patterns reveal

the conceptual backbone of the text collection

Russian WordNet is used for basic conceptual

grouping

• Automatic Corpora Text collection,

tokeniza-tion, part-of-speech tagging Reusability of

linguistic resources is achieved by annotation

of texts using a common data model For that

purpose the XML and related standards such as

XML Corpus Encoding Standard (XCES, http://www.xml-ces.org) (Ide, et al., 2000) are used in the system

CNS is localized for Russian language Oracle 9i Text doesn't have Russian language support In order to use Oracle Text capabilities we add Rus-sian Grammatical Dictionary (morphosyntactic dictionary) that consists of word paradigms with grammatical characteristics of all Corpora word index (Yablonsky S.A., 1998) It helps to perform linguistic (paradigm) search in Corpora and is also used for Corpora text annotation

System could be easily adapted to different soft-ware platforms (Java and Oracle) and the necessi-ties of other languages (Unicode), making the whole system portable to other platforms with minimal changes

The only language-specific resources are a large-scale moiphosyntactic dictionary plus POS tagger

References

Ide, N., Romary L., 2001 XML Support for Annotated

Language Resources In: Linguistic Exploration,

Workshop on Web-Based Language Documentation and Description, Dec 12 - Dec 15, 2000, University

of Pennsylvania Philadelphia, Pennsylvania, USA Ide, N & Brew, C., 2000 Requirements, Tools, and

Architectures for Annotated Corpora In:

Proceed-ings of the EAGLES/ISLE Workshop on Meta-Descriptions and Annotation Schemas for Multimo-dal/ MultimediaLanguage Resources and Data Ar-chitectures and Software Support for Large Corpora.

Paris: European Language Resources Association Oracle9i Database Documentation (Release 9.0.2), 2002.

Rational Rose Enterprise Edition 2001, Documentation Fellbaum C (ed.) WordNet An Electronic Lexical Da-tabase Bradford Books.

Yablonsky S.A., 1998 Russicon Slavonic Language Resources and Software In: A Rubio, N Gallardo, Yablonsky S.A., 2000 Russian Monitor Corpora: Com-position, Linguistic Encoding and Internet Publica-tion Proceedings Second International Conference

on Language Resources & Evaluation, Athens, Greece, 2000.

Yablonsky S.A., 2002 Corpora as Object-Oriented Sys-tem From UML-notation to Implementation Pro-ceedings Third International Conference on Language Resources & Evaluation, Las Palmas, Ca-nary Islands-Spain, 2002.

Định dạng
Số trang	4
Dung lượng	598,8 KB