1. Trang chủ
  2. » Luận Văn - Báo Cáo

Báo cáo khoa học: "DETECTING PATTERNS IN ALEXICAL DATABASE" pdf

4 229 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 4
Dung lượng 309,77 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

While evidencing the second relation considered, one can investigate as to w h e t h e r it is possible to discover any correlation be~wneI* lexical or grammatical features in definition

Trang 1

D E T E C T I N G PATTERNS IN A L E X I C A L D A T A BASE

Nicoletta Calzolari

D i p a r t i m e n t o di Linguistica - Universita' di Pisa Istituto di L i n g u i s t i c a C o m p u t a z i o n a l e del CNR

Via della F a g g i o l a 32

50100 Pisa - Italy

ABSTRACT

I n a w e l l - s t r u c t u r e d Lexica] Data B a s e , a

n u m b e r of relations among lexica] entries can he

interactively evidenced The present article

examines hyponymy, as an example of p a r a d i g m a t i c

relation, and "restriction" relation, as a

syntagmatic relation The theoretical results of

their implementation are illustrated

I INTRODUCTION

In previous papers it has been pointed out

that ill a well-structured Lexical Data Has( it

becomes possible to detect automatical;y, an(l ~ e

evidence through interactlve queries a number Of

m o r p h o l o g i c a ] , s y n t a c t i c , o r s e m a n t i~

r e l a t i o n s h i p s b e t w e e n l e x i c a l e n t r i e s , .~uch ~lb

s y n o n y m y , h y p o n y m y , h y p e r o n y m y , d e r i v a t i o n ,

case-argument, l e x i c a l f i e l d , e t c

The present article examines hyponymy, a.~ dI:

example o f paradigmatic r e l a t i o n , and what can b(

c a l l e d " r e s t r i c t i o n or m o d i f i c a t i o n " r e l a L i o n , as

a syntagmat ic relation, l-~y reSLl'iet Jell or

m o d i f i c a t i o n relation, l mean that part of a

so-called "aristotellan" definition which has tiJe

function of linking th(~ "genus" and the

"differentia specifica"

When evidenced in a lexicon, tile hyponymy

relation produces hierarchical trees partitioniI*K

the lexicon in many semant ica i ly coilerent

s u b s e t s T h e s e t r e e s a r e n o t c r e a t e d o n c e a n d

for al i, but it is important that uhey are

p r o c e d u r a l l y activated at the query moment

While evidencing the second relation

considered, one can investigate as to w h e t h e r it

is possible to discover any correlation be~wneI*

lexical or grammatical features in definitions

and particular kinds of "definienda", and thus

try to answer questions such as the following:

"Are there any connections between these

restriction relations and ~he fundamental ways of

definition, i.e the criterial parameters by

w h i c h people defines things?"

For both relations, the p a p e r presents the

d i f f e r e n t procedures by w h i c h they are"

a u t o m a t i c a l l y recognized and e x t r a c t e d from the natural language definitions, the degree of reliability of their automatic labeling, the use

of these labels in interactive queries on the lexical data base, and finally the theoretical results of their implementation in a

Machine-Dictionary

II THE LANGUAGE OF DEFINITIONS AS A SUBLANGUAGE

1 am trying to develop and exploit the idea of

c o n s i d e r i n g the language of d i c t i o n a r y definitions as a particular sublanguage within

n a t u r a l l a n g u a g e T h i s p e r s p e c t i v e c a n n o t

o b v i o u s l y be adopted for subject matter restrictions in definitions, but only for the purpose of the text, i.e the s p e c i f i c

c o m m u n i c a t i v e goal From t h i s restriction on the purpose of the text, certain lexico-grammatical restrictions do result, which prove to be very useful

As t o t i l e r e s t r i c t i o n s on t i l e l e x i c a l r i c h n e s s

of definitions, these are not due to the fact that they relate to a s p e c i f i c domain of discourse, but only to the p r o p e r t y of closure (although not satisfied at 100%') that the

d e f i n i n g v o c a b u l a r y s h o u l d in principle be simpler and more restricted than the defined set

of ]emmas, i.e the former should be a proper subset of the latter

This kind of quantitative restriction on the

v o c a b u l a r y of definitions would not be of any interest in itself, if it were not accompanied by other kinds of constraints both on a) the lexical, and on b) the grammatical side

a) From the frequency list of the words used

in definitions (about 800,000 w o r d - o c c u r r e n c e s , and 75,000 word-types), it appears in fact that some words have a much greater importance than in normal language, as evidenced by a c o m p a r i s o n

w i t h the data o f the Lessico di Frequenza della Lingua Italiano Contemporaneo ( B o r t o l i n i e t a l , 1971) T h e s e are the d e f i n i n g generic terms

Trang 2

such as ACT, EFFECT, PERSON, OBJECT, WHO,

PROCESS, CAUSE, etc It is not by chance that

these same concepts are of relevance in many

Artificial Intelligence systems

b) Not only single words, or classes of words,

are particularly relevant in the defining

sublanguage There are also lexical patterns and

syntactic patterns which occur with great

frequency, and which play a very special role in

defining sentences

The combination of these constraints c a r l be

and actually is very useful, when trying to

exploit the information contained in definitions,

and when transforming an archive of natural

language definitions into a knowledge base

structured as a network Some important parts of

knowledge are in fact already retrievable in

interactive mode from the Italian Lexica] Data

Base, which has recently been restructured

Analyses on large corpora of definitions,

carried out on many dictionaries (Amsler I')80;

Calzolari, 1983a, 1983b; M i c h i e l s , Noel, 1 ' ) 8 2 )

have in fact shown that the definitions

sublanguage displays several regularities of

lexJca] and syntactic occurrences and p a t t e r n s

T h e s e g e n e r a l l e x i c a ] c ] a s s e s a n d t h e c l a s s e s o f

r e c u r r e n t p a t t e r n s c a n b e m o r e o r l e s s e u s i ] y

c a p t u r e d f o r i n s t a n c e b y p a t t e r n - m a t c h i n g r l e s

a n d i f p o s s i b l e c h a r a c t e r i z e d w i t h f o r m a l r u l e s

I I ] HYPONYMY RELATION

Hyponymy i s the most important r e l a t i o n t o b(,

evidenced ill a lexicon Due t O it.% taxollom i {:

nature, it gives the lexicon, when implemented, a

particular hierarchical structure: its result is

obviously not a tree, but many tangled

hierarchies (Amsler, 1980)

Instead of evidencing and labelling this

relation by hand, I have tried to characterize it

procedurally The procedure which automatically

coded (with a precision of more thah 90%

calculated on a random sample of 2000

definitions) true superordinates in all the

definitions (approx 185.000 for ]03.000 iemmas)

was based almost exclusively on the position of

the "genus" term at the beginning of the

definitional phrases, giving Nouns, Verbs and

Adjectives as superordinates of defined entries

of the same lexical category Ad hoc subroutines

solved exceptional cases where a) quantifiers, or

other modifiers preceded the genus term (e.g

aletta -> piccolo g r u p p o di Donne dietro

l'angolo dell'ala), or b) more than one genus was

present in the definition (e.g Q s s o r d o r e ->

prepositional phrase, usually of locative type,

was at the beginning of the phrase (e.g piazzato -> nel rugby, calcio al pallone collocate sul

terreno)

Even though the first immediate purpose of this procedure is of classificationa] nature, the ultimate goal is the extraction and formalization

of the most relevant relationship between lexical items which is implicitly stored in any standard printed dictionary It is in fact now possible

to retrieve in the ]exica] data base not only all the definitions in which any possible word-form appears, together with the defined lemmas (e.g SUONO appears in 328 definitions), but also to retrieve on-line, if desired, only the definitions in which the given word-form is used

as a superordinate, therefore with the list of

its hyponyms (e.g the same word SUONO is used as superordinate of only 65 words, i.e of a subset

of the preceding set containing MUSICA, RUNORE, SQUILLO, SUSSURRO, etc.~

The query-language so far implemented for the lexica] data base permits therefore to retrieve information on this hierarchical relation

interconnections within the entire lexicon The links produced can he analyzed, evaluated, and,

if necessary, interactive]y corrected

From explorations on the trees thus obtained

we can also try Lo set up classes and subclasses

of superordinates, on the basis of the upper

nodes to which many other nodes are connected as

descendants Only as an example, the identification criterion for the noun-class

"SET-OF" containing ]NSIEME, GRUPPO, COLLEZJONE, COMPLESSO AGGREGATO etc., among the set of

noun-superordinates, is the fact that they are linked one to the other in the tree which results

from querying the data base Their hyponyms will obviously be for the most part collective nouns

The identification of word-classes like this one leads to the next step Jn the formalization

of the hyponymy relation, which will consist in the insertion of a label indicating a semantic class to these sets of superordinates It will thus be possible to retrieve, for example, all the nouns generically definable as "SET-OF", independently of tile particular word denoting a set used in definitions Since it is already possible to trace these chains of hyponyms going upwards or downwards for more than one level, one can immediately ask whether, for example, MASSERIA belongs to the set of collectives even

if it is defined as HANDRIA, because MANDRIA is defined as BRANCO, which is in turn defined as INSIENE, w h i c h finally is one of the nouns belonging to the class "SET-OF"

Trang 3

E v e n t h o u g h some r e f i n e m e n t s are s t i l l

r e q u i r e d in o r d e r to i m p r o v e the r e l i a b i l i t y o f

the a u t o m a t i c r e c o v e r y of I S A - r e ] a t e d terms

c h a i n s , this k i n d of s t r u c t u r a l r e l a t i o n w i t h i n

t h e lexicon, that is h y p o n y m y , is at a g o o d s t a g e

o f i m p l e m e n t a t i o n in the I t a l i a n ]exica] data

base

M u c h still r e m a i n s to be d o n e as far as o t h e r

v e r y i n t e r e s t i n g rel at iouships bt~tween tile

e n t r i e s are c o n c e r n e d I a m n o w c o n s i d e r i n g w h a t

c o u l d be c a l l e d " r e s t r i c t i o n o r modificatioi*"

relation, s i n c e its p u r p o s e is to r e s t r i c t or

m o d i f y the m e a n i n g of the g e n u s term It is

e x e m p l i f i e d in t h e f o l l o w i n g d e f i n i t i o n s b y the

w o r d s in italics:

s t a n n J t e -> c a l c o p i r i t e contenente s t a g n o

a r r i c c i o l a r e -> m o d e l l a r e o [ o r m o di r i e c i o l o

r i s o n a t o r e - - - : " d i s p o s i t i v o otto o g e n e r a r o

r i s o n a u z a

I w i s h to e v a l u a t e what c o u l d be d o n e w i t h

r e s p e c t to this k i n d of r e l a t i o n , s t a r t i n g from

the a v a i l a b l e d e f i n i t i o n a l data One of the

first aims of this l e x i c o l o g J c a l rese;Irch i s to

analyze, b y m ~ a n s of c o m p u t a t i o n a l tools ;llld to

use tile i n f o r m a t i o n C o n L a l n e d in tile dJ fl or,,nL

d e f i n i t i o n a l formats and s u r u c t u r e s "l'i~c

i m p l e m e n t a L i o n of a n u m b e r of proc:eduros w h i c h

c o n v e r t the n a t u r a l language i n f o r m a t i o n convey~,d

by d e f i n i t i o n s into p r o c e s s a b l e formals, m a d e tlp

b y s t r u c t u r e d r e l a t i o n a l links b e t w e e n lexJcal

items or c l a s s e s of lexical items, i.~ n o k Lakol;

into c o n s i d e r a t i o n

T h e s e f o r m a l s call be made ~ r a c e a b l e e.g in all

I n f o r m a t i o n R e t r i e v a l s y s t e m on d e f i n i t i o n s , like,

the one a c t u a l l y implemented, o n th,: entir.,

c o r p u s , for the t a x o n o m i c p a r t of the |exical

s t r u c t u r e But these f o r m a t t e d re I a t i o n a ]

s t r u c t u r e s can also be u s e d as s t a r t i n g p o i n t s

for a c o m p u t a t i o n a l l y e x p l o i t a b l e r e o r g n n i z a t ~ o n

of the d e f i n i t i o n a l content (me, of t h e

c h a r a c t e r i s t i c s of the d e f i n i t i o n a l s u b l a n g u a g e ,

i.e the p r e s e n c e of r e c u r r e n t p a t t e r n s ( ,%uch as

p r o p r i o d i , r e l o t i v o o, p r o d o t r o do, o r i g i n o r i o

di, etc.), e n a b l e s , at least in c e r t a i n cases, to

p r o d u c e a c o n s t a n t m a p p l n g from c e r t a i n v a r i a b l e

types of m o r e f r e q u e n t l y d e t e c t e d d e f i n i t i o n a l

p h r a s e s no c o n s t a n t u n d e r l y i n g r e l a t i o n a !

s t r u c t u r e s

U s i n g r a t h e r s i m p l e p a t t e r n - m a t c h i n g

p r o c e d u r e s s o m e classes a n d s u b c l a s s e ~ of

d e f i n i t i o n s can be separated, and a small n u m b e r

of s i m p l e r types of d e f i n i t i o n s h a v e a l r e a d y been

c o n v e r t e d into a f o r m a l i z e d c o d e d format a l s o

w i t h r e g a r d to this r e s t r i c t i o n relation A n e w

d a t a base T h e d i s t i n g u i s h e d e l e m e n t s o f a

n u m b e r o f s i m p l e n a t u r a l l a n g u a g e p a t t e r n s are

m a p p e d into s o m e g e n e r a l s t r u c t u r e d i n f o r m a t i o n formats U p to now, some o f the d e f i n i t i o n s

d i s p l a y i n g the f o l l o w i n g r e s t r i c t i o n r e l a t i o n s

h a v e b e e n t r e a t e d :

R E L F O R M (e.g o formo di)

R E L P R O V (e.g provvisto di)

R E L A P T (e.g otto o)

a n d t h e c o r r e s p o n d i n g r e l a t i o n a l links g e n e r a t e d

A m o n g t h e l e x i c a l v a r i a n t s o f R E L P R O V t h e r e

rlcco di, etc.; w h i l e R E L F O R M g r o u p s the

f o l l o w i n g v a r i a n t s of a d i f f e r e n t type: in [ormo

d i , che ha ( l a ) forma ( d i ) , di f o r m o , di formo

simile a (quella d i ) , $otto forma d l , avente formo

d i , e t c , I t i s t h u s p o s s i b l e , f o r e x a m p l e , t o

r e t r i e v e , a m o n g the 1271 d e f i n i t i o n s in w h i c h the

w o r d F O R H A a p p e a r s , o n l y those d e f i n i n g s o m e t h i n g

as " h a v i n g the s h a p e of s o m e t h i n g else" T h e

i m p l e m e n t a t i o n of t h e s e links a l l o w s to p r o d u c e

a n o t h e r k i n d of p a r t i t i o n i n g w i t h i n the lexical

s y s t e m , and p e r m i t s to b e t t e r i n v e s t i g a t e the internal s t r u c t u r e of words

A p r o c e d u r e of the k i n d e x e m p l i f i e d above,

b a s e d on p a t t e r n - m a t c h i n g , is p o s s i b l e for a g o o d

n u m b e r of d e f i n i t i o n types; for e x a m p l e , w i t h a

d i f f e r e n t formaL, for m a n y a d j e c t i v e s :

A d j >> R E L X

: VP :

w h e r e s e v e r a l g r o u p s of d e f i n i t i o n s are found to

s h a r e a c o m m o n u n d e r l y i n g s t r u c t u r e in terms of the r e s t r i c t i o n r e l a t i o n involved, in s p i t e of

o t h e r lexical a n d s y n t a c t i c d i f f e r e n c e s

V F U T U R E P E R S P E C T I V E S

A c o m p a r i s o n w i t h the d e f i n i t i o n a l c o r p o r a of

o t h e r d i c t i o n a r i e s , a l s o of o t h e r languages, w i l l

c e r t a i n l y p r o v e to be useful in e s t a b l i s h i n g the set of the m o s t g e n e r a l or p r i m i t i v e R e l a t i o n s ,

u s e d for d e f i n i t i o n in l e x i c o g r a p h i e a l p r a c t i c e ,

o f t e n o v e r l a p p i n g w i t h the p r i m i t i v e R e l a t i o n s

s t a t e d in m a n y AI systems T h e s e r e l a t i o n s ,

m a p p e d into a formal link in the d a t a base, can then be p a r a p h r a s e d in each l a n g u a g e , in the

s t a n d a r d language

T h e d a t a b a s e s t r u c t u r e e n v i s a g e d does p e r m i t

b o t h to m a i n t a i n at a lower level (the s t a r t i n g level), and to e l i m i n a t e at an u p p e r level, m a n y

p e c u l i a r i t i e s a n d v a r i a t i o n s in the l i n g u i s t i c

Trang 4

relations; their effect is to facilitate the

comprehension by the users of the printed

dictionary, inhibiting however immediate

comprehension by procedural routines in the

mechanical processing of dictionary data

By applying similar methods of automatic

conversion and mapping into suitable formats, as

extensively as possible throughout the lexicon,

many definitional expressions can be submitted to

an attempt of standardization, thus achieving

major precision, which gives a considerable

improvement when performing, for example,

information retrieval operations on the content

of a dictionary

This more structured, but, in another sense

simplified version of definitions, which also

accounts for their relational nature, provides an

excellent basis for testing and studying the

"knowledge of the world" which underlies the

structure of a dictionary

Vl REFERENCES

Alinei, M., La Struttura del l,essico, Bologna: Ii

Hulino, 1974

Amsler, R.A., The Structure of t h e

Herriam-Webster Pocket Dictionary, Ph.D,

Thesis, Department of Computer Science~

University of T e x a s , Austin, Texas, 1')80

Bortolini, U., Tag]iavini, C , Zampolli, A

Lessico di Frequenza de] la Lingua I ta] ian,J

C o n t e m p o r a n e a , H i l a n o : G a r z a n t i 1972

Calzolari, N , "Towards t h e o r g a n i z a t i o n o f

lexical definitions or a d a t a bus,'

s t r u c t u r e , COLING82 A b s t r a c t s , ed by" E

H a j i ~ o v ~ , P r a g u e : C h a r l e s U n i v e r s i t y , 1982,

6 1 - 6 4

Calzolari, N., "Lexiual definitions in a

computerized dictionary'", Computers and

Artificial Intelligence, II(1983a~3, 225-233

Calzolari, N , "Semantic links and t h e

dictionary", in Proceedings of the ~tl !

International Conference on Computers and t h e

Humanities, ed by S.K.Burton, D D S h o r L ,

Rockville (Haryland): Computer Science

Press, 1983b, 47-50

Calzolari, N., Ceccotti, H.L., "Organizing a

large scale lexica] database dictionary",

Acres du Con~r~s Informatique et Sciences

Humaines, Li&ge: L.A.S.L.A., 1981, 155-163

verbs", Language, 55(1979)4, 767-811

Evens, M.W., Litowitz, B.E., Harkowitz, J.A., Smith, R.N., Werner, O., Lexical-Semantic Relations: a Comparative Survey, Edmonton, Alberta: Linguistic Research Inc., 1980

Findler, N.V (ed.), Associative Networks, New York: Academic Press, 1979

Hendrix, G.G., "Natural-language interface",

P r o c e e d i n g s of the Workshop 'Applied Computational Linguistics in Perspective', American Journal of Computational

L i n g u i s t i c s , 8 ( 1 9 8 - ) - , 5 6 - 6 1 Michiels, A., M~llenders, J., No~l, J.,

"Exploiting a large data base by Longman", COLING80: Proceedings of the 8th International Conference on Computational Linguistics, Tokyo, 1980, 374-382

Hichiels, A., Noel, J., "Approaches to thesaurus production", COLING82: Proceedings o f the

N i n t h International Conference on Computational Linguistics ed by J.]lorecky', Amsterdam: North-}lo]land, 1982, 227-232

N a g a o , M., T s u j i i , J , t;eda, Y , T a k i y a m a , M.,

"An attempt to computerize dictionary dale bases", COLING80: Proceedings of tht: ~th International C o n f e r m m e on Computational Linguistics, Tokyo, ]qSO, 534-542

Quillian, H.R , "Semantic memory'", in Semantic Information Processing, ed by ~I ~li:*s ky,

C a m b r i d g e ( ~ l a s s ) : }liT P r e s s 1!)68, -,,°°' ;0.""

S m i t h , R N , "On d e f i n i n g a d j e c t i v e s : p a r t I I ] "

D i c t i o n a r i e s , t h e J o u r n a l o f t h e D i c t i o n a r y

S o c i e t y o f N o r t h A m e r i c a , W i n t e r , { l q ~ l ) 5

2 8 - 3 8

S m i t h , R N , ,Haxwell, E , "An E n g l i s h d i c t i o n - r y for c o m p u t e r i z e d syntactic and s e m a n t i c

p r o c e s s l u g " , in Comput a t i one ] ar, d Hathematica] Linguistics, ed by A.Zampo]li, N.Calzolari, Firenze: Olschki, 1977, 303-322

W a l k e r , D E , A m s l e r , R A , P r o p o s a l t o t h e

N a t i o n a l S c i e n c e F o u n d a t i o n on alJ

I n v i t a t i o n a l Workshop on M a c h i n e - R e a d a h l ~

D i c t i o n a r i e s , S R I , 1982 ( m i m e o )

Z i n g a r e l l i , N , V o c a b o l a r i o d e l l a

i t a l ~ 9 9 a , B o l o g n a : Z a n i c h e l l i , 1971

l i n g u a

Ngày đăng: 31/03/2014, 17:20

TỪ KHÓA LIÊN QUAN