The Corpora Management System Based on Java and Oracle Technologies Serge Yablonsky Petersburg Transport University, Computer Department, Moscow ay., 9, St.-Petersburg, 190031, Russia Ru
Trang 1The Corpora Management System Based on Java and Oracle Technologies
Serge Yablonsky
Petersburg Transport University, Computer Department, Moscow ay., 9,
St.-Petersburg, 190031, Russia Russicon Company, Kazanskaya str., 56, ap.2, St.-Petersburg, 190000, Russia
serge yablonsky@hotmail.com ; root@russicon.spb.su;
http://www.russicon.ru
Abstract
The paper discusses the corpora
manage-ment system (CMS) design that uses Java
and Oracle9i DBMS to support strategic
corpora analysis We present the pilot
web-based CMS to support linguists in their
daily work The system offers facilities to
assist linguists and internet users as they
search for relevant material, and then
clas-sify and annotate this material
1 Introduction
There's a wide class of documental management
solutions and products that fall under the rubric
"corpora and text mining" They are similar to data
mining solutions in that they deal with large
vol-umes of data, but the difference between the two
technology solutions is that while data mining
ex-tracts, analyzes, and summarizes numerical,
struc-tured data, text mining handles large volumes of
unstructured, text-based data Document systems
with large-scale linguistic annotation are used by a
wide range of research and commercial
applica-tions
This paper presents a web-based text corpora
velopment system (CMS) that focuses on the
de-velopment of UML-specifications, architecture and
actual implementations of DBMS tools to support
strategic corpora analysis
We present the basic features of a prototype
cor-pora management system under development
in-tended to support linguists in their daily work The
system offers facilities to assist linguists and inter-net users as they search for relevant material, and then classify and annotate this material in a reposi-tory
The CMS is implemented using Java and commer-cial DBMS Oracle9i
2 System Overview
The Corpora management system combines Java, XML, XSL, HTML, and Oracle9i components (Yablonsky S.A., 2002) The system was by adapt-ing existadapt-ing and new DBMS and Java tools to the necessities of the intended task for the Russian language (Yablonsky S.A., 2000)
CMS consists of such main parts (see figure 1): Corpora — files in more then 150 different for-mats (doc, rtf, pdf, htm, xml, etc.);
Annotated corpora — files in XML format us-ing XML Corpus Encodus-ing Standard (XCES, http://www.xml-ces.org — Ide, N & Brew, C.,
2000, Ide, N., Romary L., 2001) and text for-mats;
Oracle 9i DBMS Enterprise Sever (Release 2) with such main counterparts:
o Grammatical dictionaries (inflec-tion paradigms of the given lan-guage) for the languages that are not supported by Oracle Text;
o Ontologies / WordNet (Fellbaum C.) / Domain Thesauruses for given languages;
o Word Index (index of all entry words or lemmas of Corpora);
Trang 2Search Engine
Concept Clustering (Ontology Thesaurus) Document Management System Graphical Display and Navigation
DBMS Oracle 9iR2 ( -C-cirp -Cr— a (PC, Internet/intrar
- Ontology database
- Grammatical Dictionary XML, HTML, Doc, PDF etc.
Web-server (Apache 2.x + Tomcat 4.1.x)
Annotated Corpora
XML
41101rn et/I ntraneNP
Figure 1 CMS structure
O Oracle Text.
Web Server (for example, Apache 2.x plus
Tomcat 4.1.x) with such Java Server Pages
(JSP) counterparts:
O Search Engine;
o Concept Clustering;
o Document Management System;
o Interface Subsystem.
We present the fragment of CMS
UML-specification built by Rational Rose (Rational Rose
Enterprise Edition Documentation, 2001) that
could be expanded in future by community of
lan-guage and speech resources developers (see figure
2)
Here, for example, the relational table DOCS con-tains such attributes:
DOCS_NAME — document name;
DOCS_AUTHOR — document author;
DOCS_HTTP — document path;
DOCS_LANG — document language;
DOCS_LANG_CODE — document coding; DOCS_DATE — date of including in the Cor-pora;
DOCS_EXTENSION — document file exten-sion
The set of different types of UML-specifications brings us to full three-level system, including user, business and data services
Trang 3DOC S_NO = DOCS_NO
<<R elat onaqalJe>>
WORD_INDEX
S NO P< _VVOR D VAR C H AR 2
#W_N EVV NUMBER IZ•V\I_VVORD_PK = VVI_WORD
<<F K>>
RESULTS D
<<FE>>
RESULTS QUERIES PC
ORY_NO = QRY_NO
«Relalionakiatle»
R ESULTS
•D OC S_N 0 NUMBER
R ES PAGE NO NU VEER DOCS_NO_P1 , ( = DOC S_N 0
OD 0 CS_
#00 CS_
#12)0 CS_
#00 CS_
#RUBRI
#00 CS
#D 0 CS_ LAN G_ CODE : VAR CHAR 2
#11) 0 CS_ DATE : DATE
tOCS_EXTEN SION : VAR CH AR 2
0 CS_ NANE_ PK = D 0 CS_ NAME
OCS_N0 _IN D = DOCS_N 0
NUMBER
NAIvE VARC HAP 2
AU TH OR VAR CH AR 2
HTTP VAR CHAR 2
CS_N 0 NUMBER
LANG VAR CHAR 2
<<R ationaiTabl e>>
RES_ADV_SEAR
#DOCS_N 0 NUMBER
A r#SEARCH_ID = SEAR C H_E)
<<RelationalTable>>
RU BR ICS
.1.RU BR ICS_N 0 : NUMBER
RU BR ICS_PAR ENT : NUMBER
#RU BR I CS_ N AME : VAR CH AR 2
tU BR ICS_DESC : VAR C HAP 2
U BR ICS_N AME_PK = RUBRICS_NAME UBRICS_N O_IND = RUBRICS_NO
<<R narraNe>>
QUERIES
#QP Y_N 0 NUMBER
#QP Y_QU ER Y VAR CHAR 2
#OR Y_R B1PrO NUMBER
#QR Y_EXTEN SION VAR C HAR 2
#OR Y_LANG VAR CH AR2
#QR Y_DATE VAR CHAR 2
#QR Y_IN CLUDE
1. 4. QUER ES_ Q UER Y_PE = OR Y_QUERY
4 r#QU ER ES_N O_IND = QR Y NO
<<R e I ati anal Tabl >>
DO C S
03
<<F1-4>>
DOGS RUBRICS N 0_ FK RUBRICS NO = RUBRICS_NO
REF ADV 3 ER
DOD S_NO = DOD S_NO
Figure 2 Fragment of UML notation of data model
UML specification of business services uses
stan-dard UML notations of stanstan-dard linguistic
annota-tion and corpora manipulaannota-tion procedures
Reusability of linguistic and corpora manipulation
business services could be achieved by usage of a
widely accepted set of UML notation standards for
corpus-based work in natural language processing
applications
3 System Features
The powerful search engine of prototype system
particular uses advantages of Oracle Text search
and services and includes (Oracle9i Database
Documentation):
• Content-based retrieval on free text with both
literal (word) predicates and thematic
predi-cates It includes: a comprehensive range of
operators and index preferences (e.g Boolean, exact phrase match, proximity, section search-ing, fuzzy, stemmsearch-ing, wildcard, thesaurus, stopwords, case sensitivity, and search scor-ing), "about" search, structured search, broad document format support and multi-language support For example, texts can be searched for stems of words e.g "teach" would return
"teaching", "taught" etc Fuzzy match can be used if you are not sure of the spelling of a word A search can be done to find words which are close to each other within a word Documents can be searched on what a docu-ment is about as opposed to the existence of specific words Gists or Theme Summaries can
be produced which produce a summary of what a document is about (using themes)
Trang 4• For XML framework XPath searching enables
sophisticated queries which can reference and
leverage the embedded structure of XML
documents — instead of using a text query to
find documents, you use a document to fmd
queries XML path searching is able to
per-form sophisticated section searches: doctype
disambiguation, attribute value searching,
automatic section indexing, and more
In addition to the search capabilities, a number of
other features are provided to simplify application
development
• Corpora Format Support In order to index
documents stored in a variety of native
for-mats, such as Word, Excel, PowerPoint,
WordPerfect, HTML, and Acrobat/PDF,
sys-tem supplies a broad variety of "filters" that
al-low documents stored in their native formats to
be indexed Support for more then 150 file
for-mats in order to index files in a large range of
formats including Word, Acrobat, HTML,
WordPerfect, Powerpoint, Excel Flexible
Stor-age Location - documents can be stored and
indexed in the database, in a location pointed
to by a URL or in an external file
• Corpora Graphical Display and Navigation.
System services can convert any supported
document format to either plain text or
format-ted text (an HTML approximation retaining as
much as possible of the original formatting;
available for all formats except PDF) Both
plain text and HTML versions may be viewed
in a standard browser
• Document Management System System
sup-plies an administration tool through which all
major text maintenance and administration
functions may be performed
• Concept Clustering identifies the relationships
between phrases of the texts and it builds a
"lexical network," grouping related phrases
and enhancing the most important features of
these groupings The resulting patterns reveal
the conceptual backbone of the text collection
Russian WordNet is used for basic conceptual
grouping
• Automatic Corpora Text collection,
tokeniza-tion, part-of-speech tagging Reusability of
linguistic resources is achieved by annotation
of texts using a common data model For that
purpose the XML and related standards such as
XML Corpus Encoding Standard (XCES, http://www.xml-ces.org) (Ide, et al., 2000) are used in the system
CNS is localized for Russian language Oracle 9i Text doesn't have Russian language support In order to use Oracle Text capabilities we add Rus-sian Grammatical Dictionary (morphosyntactic dictionary) that consists of word paradigms with grammatical characteristics of all Corpora word index (Yablonsky S.A., 1998) It helps to perform linguistic (paradigm) search in Corpora and is also used for Corpora text annotation
System could be easily adapted to different soft-ware platforms (Java and Oracle) and the necessi-ties of other languages (Unicode), making the whole system portable to other platforms with minimal changes
The only language-specific resources are a large-scale moiphosyntactic dictionary plus POS tagger
References
Ide, N., Romary L., 2001 XML Support for Annotated
Language Resources In: Linguistic Exploration,
Workshop on Web-Based Language Documentation and Description, Dec 12 - Dec 15, 2000, University
of Pennsylvania Philadelphia, Pennsylvania, USA Ide, N & Brew, C., 2000 Requirements, Tools, and
Architectures for Annotated Corpora In:
Proceed-ings of the EAGLES/ISLE Workshop on Meta-Descriptions and Annotation Schemas for Multimo-dal/ MultimediaLanguage Resources and Data Ar-chitectures and Software Support for Large Corpora.
Paris: European Language Resources Association Oracle9i Database Documentation (Release 9.0.2), 2002.
Rational Rose Enterprise Edition 2001, Documentation Fellbaum C (ed.) WordNet An Electronic Lexical Da-tabase Bradford Books.
Yablonsky S.A., 1998 Russicon Slavonic Language Resources and Software In: A Rubio, N Gallardo, Yablonsky S.A., 2000 Russian Monitor Corpora: Com-position, Linguistic Encoding and Internet Publica-tion Proceedings Second International Conference
on Language Resources & Evaluation, Athens, Greece, 2000.
Yablonsky S.A., 2002 Corpora as Object-Oriented Sys-tem From UML-notation to Implementation Pro-ceedings Third International Conference on Language Resources & Evaluation, Las Palmas, Ca-nary Islands-Spain, 2002.