While evidencing the second relation considered, one can investigate as to w h e t h e r it is possible to discover any correlation be~wneI* lexical or grammatical features in definition
Trang 1D E T E C T I N G PATTERNS IN A L E X I C A L D A T A BASE
Nicoletta Calzolari
D i p a r t i m e n t o di Linguistica - Universita' di Pisa Istituto di L i n g u i s t i c a C o m p u t a z i o n a l e del CNR
Via della F a g g i o l a 32
50100 Pisa - Italy
ABSTRACT
I n a w e l l - s t r u c t u r e d Lexica] Data B a s e , a
n u m b e r of relations among lexica] entries can he
interactively evidenced The present article
examines hyponymy, as an example of p a r a d i g m a t i c
relation, and "restriction" relation, as a
syntagmatic relation The theoretical results of
their implementation are illustrated
I INTRODUCTION
In previous papers it has been pointed out
that ill a well-structured Lexical Data Has( it
becomes possible to detect automatical;y, an(l ~ e
evidence through interactlve queries a number Of
m o r p h o l o g i c a ] , s y n t a c t i c , o r s e m a n t i~
r e l a t i o n s h i p s b e t w e e n l e x i c a l e n t r i e s , .~uch ~lb
s y n o n y m y , h y p o n y m y , h y p e r o n y m y , d e r i v a t i o n ,
case-argument, l e x i c a l f i e l d , e t c
The present article examines hyponymy, a.~ dI:
example o f paradigmatic r e l a t i o n , and what can b(
c a l l e d " r e s t r i c t i o n or m o d i f i c a t i o n " r e l a L i o n , as
a syntagmat ic relation, l-~y reSLl'iet Jell or
m o d i f i c a t i o n relation, l mean that part of a
so-called "aristotellan" definition which has tiJe
function of linking th(~ "genus" and the
"differentia specifica"
When evidenced in a lexicon, tile hyponymy
relation produces hierarchical trees partitioniI*K
the lexicon in many semant ica i ly coilerent
s u b s e t s T h e s e t r e e s a r e n o t c r e a t e d o n c e a n d
for al i, but it is important that uhey are
p r o c e d u r a l l y activated at the query moment
While evidencing the second relation
considered, one can investigate as to w h e t h e r it
is possible to discover any correlation be~wneI*
lexical or grammatical features in definitions
and particular kinds of "definienda", and thus
try to answer questions such as the following:
"Are there any connections between these
restriction relations and ~he fundamental ways of
definition, i.e the criterial parameters by
w h i c h people defines things?"
For both relations, the p a p e r presents the
d i f f e r e n t procedures by w h i c h they are"
a u t o m a t i c a l l y recognized and e x t r a c t e d from the natural language definitions, the degree of reliability of their automatic labeling, the use
of these labels in interactive queries on the lexical data base, and finally the theoretical results of their implementation in a
Machine-Dictionary
II THE LANGUAGE OF DEFINITIONS AS A SUBLANGUAGE
1 am trying to develop and exploit the idea of
c o n s i d e r i n g the language of d i c t i o n a r y definitions as a particular sublanguage within
n a t u r a l l a n g u a g e T h i s p e r s p e c t i v e c a n n o t
o b v i o u s l y be adopted for subject matter restrictions in definitions, but only for the purpose of the text, i.e the s p e c i f i c
c o m m u n i c a t i v e goal From t h i s restriction on the purpose of the text, certain lexico-grammatical restrictions do result, which prove to be very useful
As t o t i l e r e s t r i c t i o n s on t i l e l e x i c a l r i c h n e s s
of definitions, these are not due to the fact that they relate to a s p e c i f i c domain of discourse, but only to the p r o p e r t y of closure (although not satisfied at 100%') that the
d e f i n i n g v o c a b u l a r y s h o u l d in principle be simpler and more restricted than the defined set
of ]emmas, i.e the former should be a proper subset of the latter
This kind of quantitative restriction on the
v o c a b u l a r y of definitions would not be of any interest in itself, if it were not accompanied by other kinds of constraints both on a) the lexical, and on b) the grammatical side
a) From the frequency list of the words used
in definitions (about 800,000 w o r d - o c c u r r e n c e s , and 75,000 word-types), it appears in fact that some words have a much greater importance than in normal language, as evidenced by a c o m p a r i s o n
w i t h the data o f the Lessico di Frequenza della Lingua Italiano Contemporaneo ( B o r t o l i n i e t a l , 1971) T h e s e are the d e f i n i n g generic terms
Trang 2such as ACT, EFFECT, PERSON, OBJECT, WHO,
PROCESS, CAUSE, etc It is not by chance that
these same concepts are of relevance in many
Artificial Intelligence systems
b) Not only single words, or classes of words,
are particularly relevant in the defining
sublanguage There are also lexical patterns and
syntactic patterns which occur with great
frequency, and which play a very special role in
defining sentences
The combination of these constraints c a r l be
and actually is very useful, when trying to
exploit the information contained in definitions,
and when transforming an archive of natural
language definitions into a knowledge base
structured as a network Some important parts of
knowledge are in fact already retrievable in
interactive mode from the Italian Lexica] Data
Base, which has recently been restructured
Analyses on large corpora of definitions,
carried out on many dictionaries (Amsler I')80;
Calzolari, 1983a, 1983b; M i c h i e l s , Noel, 1 ' ) 8 2 )
have in fact shown that the definitions
sublanguage displays several regularities of
lexJca] and syntactic occurrences and p a t t e r n s
T h e s e g e n e r a l l e x i c a ] c ] a s s e s a n d t h e c l a s s e s o f
r e c u r r e n t p a t t e r n s c a n b e m o r e o r l e s s e u s i ] y
c a p t u r e d f o r i n s t a n c e b y p a t t e r n - m a t c h i n g r l e s
a n d i f p o s s i b l e c h a r a c t e r i z e d w i t h f o r m a l r u l e s
I I ] HYPONYMY RELATION
Hyponymy i s the most important r e l a t i o n t o b(,
evidenced ill a lexicon Due t O it.% taxollom i {:
nature, it gives the lexicon, when implemented, a
particular hierarchical structure: its result is
obviously not a tree, but many tangled
hierarchies (Amsler, 1980)
Instead of evidencing and labelling this
relation by hand, I have tried to characterize it
procedurally The procedure which automatically
coded (with a precision of more thah 90%
calculated on a random sample of 2000
definitions) true superordinates in all the
definitions (approx 185.000 for ]03.000 iemmas)
was based almost exclusively on the position of
the "genus" term at the beginning of the
definitional phrases, giving Nouns, Verbs and
Adjectives as superordinates of defined entries
of the same lexical category Ad hoc subroutines
solved exceptional cases where a) quantifiers, or
other modifiers preceded the genus term (e.g
aletta -> piccolo g r u p p o di Donne dietro
l'angolo dell'ala), or b) more than one genus was
present in the definition (e.g Q s s o r d o r e ->
prepositional phrase, usually of locative type,
was at the beginning of the phrase (e.g piazzato -> nel rugby, calcio al pallone collocate sul
terreno)
Even though the first immediate purpose of this procedure is of classificationa] nature, the ultimate goal is the extraction and formalization
of the most relevant relationship between lexical items which is implicitly stored in any standard printed dictionary It is in fact now possible
to retrieve in the ]exica] data base not only all the definitions in which any possible word-form appears, together with the defined lemmas (e.g SUONO appears in 328 definitions), but also to retrieve on-line, if desired, only the definitions in which the given word-form is used
as a superordinate, therefore with the list of
its hyponyms (e.g the same word SUONO is used as superordinate of only 65 words, i.e of a subset
of the preceding set containing MUSICA, RUNORE, SQUILLO, SUSSURRO, etc.~
The query-language so far implemented for the lexica] data base permits therefore to retrieve information on this hierarchical relation
interconnections within the entire lexicon The links produced can he analyzed, evaluated, and,
if necessary, interactive]y corrected
From explorations on the trees thus obtained
we can also try Lo set up classes and subclasses
of superordinates, on the basis of the upper
nodes to which many other nodes are connected as
descendants Only as an example, the identification criterion for the noun-class
"SET-OF" containing ]NSIEME, GRUPPO, COLLEZJONE, COMPLESSO AGGREGATO etc., among the set of
noun-superordinates, is the fact that they are linked one to the other in the tree which results
from querying the data base Their hyponyms will obviously be for the most part collective nouns
The identification of word-classes like this one leads to the next step Jn the formalization
of the hyponymy relation, which will consist in the insertion of a label indicating a semantic class to these sets of superordinates It will thus be possible to retrieve, for example, all the nouns generically definable as "SET-OF", independently of tile particular word denoting a set used in definitions Since it is already possible to trace these chains of hyponyms going upwards or downwards for more than one level, one can immediately ask whether, for example, MASSERIA belongs to the set of collectives even
if it is defined as HANDRIA, because MANDRIA is defined as BRANCO, which is in turn defined as INSIENE, w h i c h finally is one of the nouns belonging to the class "SET-OF"
Trang 3E v e n t h o u g h some r e f i n e m e n t s are s t i l l
r e q u i r e d in o r d e r to i m p r o v e the r e l i a b i l i t y o f
the a u t o m a t i c r e c o v e r y of I S A - r e ] a t e d terms
c h a i n s , this k i n d of s t r u c t u r a l r e l a t i o n w i t h i n
t h e lexicon, that is h y p o n y m y , is at a g o o d s t a g e
o f i m p l e m e n t a t i o n in the I t a l i a n ]exica] data
base
M u c h still r e m a i n s to be d o n e as far as o t h e r
v e r y i n t e r e s t i n g rel at iouships bt~tween tile
e n t r i e s are c o n c e r n e d I a m n o w c o n s i d e r i n g w h a t
c o u l d be c a l l e d " r e s t r i c t i o n o r modificatioi*"
relation, s i n c e its p u r p o s e is to r e s t r i c t or
m o d i f y the m e a n i n g of the g e n u s term It is
e x e m p l i f i e d in t h e f o l l o w i n g d e f i n i t i o n s b y the
w o r d s in italics:
s t a n n J t e -> c a l c o p i r i t e contenente s t a g n o
a r r i c c i o l a r e -> m o d e l l a r e o [ o r m o di r i e c i o l o
r i s o n a t o r e - - - : " d i s p o s i t i v o otto o g e n e r a r o
r i s o n a u z a
I w i s h to e v a l u a t e what c o u l d be d o n e w i t h
r e s p e c t to this k i n d of r e l a t i o n , s t a r t i n g from
the a v a i l a b l e d e f i n i t i o n a l data One of the
first aims of this l e x i c o l o g J c a l rese;Irch i s to
analyze, b y m ~ a n s of c o m p u t a t i o n a l tools ;llld to
use tile i n f o r m a t i o n C o n L a l n e d in tile dJ fl or,,nL
d e f i n i t i o n a l formats and s u r u c t u r e s "l'i~c
i m p l e m e n t a L i o n of a n u m b e r of proc:eduros w h i c h
c o n v e r t the n a t u r a l language i n f o r m a t i o n convey~,d
by d e f i n i t i o n s into p r o c e s s a b l e formals, m a d e tlp
b y s t r u c t u r e d r e l a t i o n a l links b e t w e e n lexJcal
items or c l a s s e s of lexical items, i.~ n o k Lakol;
into c o n s i d e r a t i o n
T h e s e f o r m a l s call be made ~ r a c e a b l e e.g in all
I n f o r m a t i o n R e t r i e v a l s y s t e m on d e f i n i t i o n s , like,
the one a c t u a l l y implemented, o n th,: entir.,
c o r p u s , for the t a x o n o m i c p a r t of the |exical
s t r u c t u r e But these f o r m a t t e d re I a t i o n a ]
s t r u c t u r e s can also be u s e d as s t a r t i n g p o i n t s
for a c o m p u t a t i o n a l l y e x p l o i t a b l e r e o r g n n i z a t ~ o n
of the d e f i n i t i o n a l content (me, of t h e
c h a r a c t e r i s t i c s of the d e f i n i t i o n a l s u b l a n g u a g e ,
i.e the p r e s e n c e of r e c u r r e n t p a t t e r n s ( ,%uch as
p r o p r i o d i , r e l o t i v o o, p r o d o t r o do, o r i g i n o r i o
di, etc.), e n a b l e s , at least in c e r t a i n cases, to
p r o d u c e a c o n s t a n t m a p p l n g from c e r t a i n v a r i a b l e
types of m o r e f r e q u e n t l y d e t e c t e d d e f i n i t i o n a l
p h r a s e s no c o n s t a n t u n d e r l y i n g r e l a t i o n a !
s t r u c t u r e s
U s i n g r a t h e r s i m p l e p a t t e r n - m a t c h i n g
p r o c e d u r e s s o m e classes a n d s u b c l a s s e ~ of
d e f i n i t i o n s can be separated, and a small n u m b e r
of s i m p l e r types of d e f i n i t i o n s h a v e a l r e a d y been
c o n v e r t e d into a f o r m a l i z e d c o d e d format a l s o
w i t h r e g a r d to this r e s t r i c t i o n relation A n e w
d a t a base T h e d i s t i n g u i s h e d e l e m e n t s o f a
n u m b e r o f s i m p l e n a t u r a l l a n g u a g e p a t t e r n s are
m a p p e d into s o m e g e n e r a l s t r u c t u r e d i n f o r m a t i o n formats U p to now, some o f the d e f i n i t i o n s
d i s p l a y i n g the f o l l o w i n g r e s t r i c t i o n r e l a t i o n s
h a v e b e e n t r e a t e d :
R E L F O R M (e.g o formo di)
R E L P R O V (e.g provvisto di)
R E L A P T (e.g otto o)
a n d t h e c o r r e s p o n d i n g r e l a t i o n a l links g e n e r a t e d
A m o n g t h e l e x i c a l v a r i a n t s o f R E L P R O V t h e r e
rlcco di, etc.; w h i l e R E L F O R M g r o u p s the
f o l l o w i n g v a r i a n t s of a d i f f e r e n t type: in [ormo
d i , che ha ( l a ) forma ( d i ) , di f o r m o , di formo
simile a (quella d i ) , $otto forma d l , avente formo
d i , e t c , I t i s t h u s p o s s i b l e , f o r e x a m p l e , t o
r e t r i e v e , a m o n g the 1271 d e f i n i t i o n s in w h i c h the
w o r d F O R H A a p p e a r s , o n l y those d e f i n i n g s o m e t h i n g
as " h a v i n g the s h a p e of s o m e t h i n g else" T h e
i m p l e m e n t a t i o n of t h e s e links a l l o w s to p r o d u c e
a n o t h e r k i n d of p a r t i t i o n i n g w i t h i n the lexical
s y s t e m , and p e r m i t s to b e t t e r i n v e s t i g a t e the internal s t r u c t u r e of words
A p r o c e d u r e of the k i n d e x e m p l i f i e d above,
b a s e d on p a t t e r n - m a t c h i n g , is p o s s i b l e for a g o o d
n u m b e r of d e f i n i t i o n types; for e x a m p l e , w i t h a
d i f f e r e n t formaL, for m a n y a d j e c t i v e s :
A d j >> R E L X
: VP :
w h e r e s e v e r a l g r o u p s of d e f i n i t i o n s are found to
s h a r e a c o m m o n u n d e r l y i n g s t r u c t u r e in terms of the r e s t r i c t i o n r e l a t i o n involved, in s p i t e of
o t h e r lexical a n d s y n t a c t i c d i f f e r e n c e s
V F U T U R E P E R S P E C T I V E S
A c o m p a r i s o n w i t h the d e f i n i t i o n a l c o r p o r a of
o t h e r d i c t i o n a r i e s , a l s o of o t h e r languages, w i l l
c e r t a i n l y p r o v e to be useful in e s t a b l i s h i n g the set of the m o s t g e n e r a l or p r i m i t i v e R e l a t i o n s ,
u s e d for d e f i n i t i o n in l e x i c o g r a p h i e a l p r a c t i c e ,
o f t e n o v e r l a p p i n g w i t h the p r i m i t i v e R e l a t i o n s
s t a t e d in m a n y AI systems T h e s e r e l a t i o n s ,
m a p p e d into a formal link in the d a t a base, can then be p a r a p h r a s e d in each l a n g u a g e , in the
s t a n d a r d language
T h e d a t a b a s e s t r u c t u r e e n v i s a g e d does p e r m i t
b o t h to m a i n t a i n at a lower level (the s t a r t i n g level), and to e l i m i n a t e at an u p p e r level, m a n y
p e c u l i a r i t i e s a n d v a r i a t i o n s in the l i n g u i s t i c
Trang 4relations; their effect is to facilitate the
comprehension by the users of the printed
dictionary, inhibiting however immediate
comprehension by procedural routines in the
mechanical processing of dictionary data
By applying similar methods of automatic
conversion and mapping into suitable formats, as
extensively as possible throughout the lexicon,
many definitional expressions can be submitted to
an attempt of standardization, thus achieving
major precision, which gives a considerable
improvement when performing, for example,
information retrieval operations on the content
of a dictionary
This more structured, but, in another sense
simplified version of definitions, which also
accounts for their relational nature, provides an
excellent basis for testing and studying the
"knowledge of the world" which underlies the
structure of a dictionary
Vl REFERENCES
Alinei, M., La Struttura del l,essico, Bologna: Ii
Hulino, 1974
Amsler, R.A., The Structure of t h e
Herriam-Webster Pocket Dictionary, Ph.D,
Thesis, Department of Computer Science~
University of T e x a s , Austin, Texas, 1')80
Bortolini, U., Tag]iavini, C , Zampolli, A
Lessico di Frequenza de] la Lingua I ta] ian,J
C o n t e m p o r a n e a , H i l a n o : G a r z a n t i 1972
Calzolari, N , "Towards t h e o r g a n i z a t i o n o f
lexical definitions or a d a t a bus,'
s t r u c t u r e , COLING82 A b s t r a c t s , ed by" E
H a j i ~ o v ~ , P r a g u e : C h a r l e s U n i v e r s i t y , 1982,
6 1 - 6 4
Calzolari, N., "Lexiual definitions in a
computerized dictionary'", Computers and
Artificial Intelligence, II(1983a~3, 225-233
Calzolari, N , "Semantic links and t h e
dictionary", in Proceedings of the ~tl !
International Conference on Computers and t h e
Humanities, ed by S.K.Burton, D D S h o r L ,
Rockville (Haryland): Computer Science
Press, 1983b, 47-50
Calzolari, N., Ceccotti, H.L., "Organizing a
large scale lexica] database dictionary",
Acres du Con~r~s Informatique et Sciences
Humaines, Li&ge: L.A.S.L.A., 1981, 155-163
verbs", Language, 55(1979)4, 767-811
Evens, M.W., Litowitz, B.E., Harkowitz, J.A., Smith, R.N., Werner, O., Lexical-Semantic Relations: a Comparative Survey, Edmonton, Alberta: Linguistic Research Inc., 1980
Findler, N.V (ed.), Associative Networks, New York: Academic Press, 1979
Hendrix, G.G., "Natural-language interface",
P r o c e e d i n g s of the Workshop 'Applied Computational Linguistics in Perspective', American Journal of Computational
L i n g u i s t i c s , 8 ( 1 9 8 - ) - , 5 6 - 6 1 Michiels, A., M~llenders, J., No~l, J.,
"Exploiting a large data base by Longman", COLING80: Proceedings of the 8th International Conference on Computational Linguistics, Tokyo, 1980, 374-382
Hichiels, A., Noel, J., "Approaches to thesaurus production", COLING82: Proceedings o f the
N i n t h International Conference on Computational Linguistics ed by J.]lorecky', Amsterdam: North-}lo]land, 1982, 227-232
N a g a o , M., T s u j i i , J , t;eda, Y , T a k i y a m a , M.,
"An attempt to computerize dictionary dale bases", COLING80: Proceedings of tht: ~th International C o n f e r m m e on Computational Linguistics, Tokyo, ]qSO, 534-542
Quillian, H.R , "Semantic memory'", in Semantic Information Processing, ed by ~I ~li:*s ky,
C a m b r i d g e ( ~ l a s s ) : }liT P r e s s 1!)68, -,,°°' ;0.""
S m i t h , R N , "On d e f i n i n g a d j e c t i v e s : p a r t I I ] "
D i c t i o n a r i e s , t h e J o u r n a l o f t h e D i c t i o n a r y
S o c i e t y o f N o r t h A m e r i c a , W i n t e r , { l q ~ l ) 5
2 8 - 3 8
S m i t h , R N , ,Haxwell, E , "An E n g l i s h d i c t i o n - r y for c o m p u t e r i z e d syntactic and s e m a n t i c
p r o c e s s l u g " , in Comput a t i one ] ar, d Hathematica] Linguistics, ed by A.Zampo]li, N.Calzolari, Firenze: Olschki, 1977, 303-322
W a l k e r , D E , A m s l e r , R A , P r o p o s a l t o t h e
N a t i o n a l S c i e n c e F o u n d a t i o n on alJ
I n v i t a t i o n a l Workshop on M a c h i n e - R e a d a h l ~
D i c t i o n a r i e s , S R I , 1982 ( m i m e o )
Z i n g a r e l l i , N , V o c a b o l a r i o d e l l a
i t a l ~ 9 9 a , B o l o g n a : Z a n i c h e l l i , 1971
l i n g u a