1. Trang chủ
  2. » Luận Văn - Báo Cáo

Tài liệu Báo cáo khoa học: "TOWARDS A CORE VOCABULARY FOR SYSTEM A NATURAL LANGUAGE" potx

3 411 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Towards a core vocabulary for a natural language system
Tác giả Hubert Lehmann
Trường học IBM Deutschland GmbH, Scientific Center Institute for Knowledge Based Systems (Heidelberg)
Chuyên ngành Natural language processing
Thể loại Research paper
Thành phố Heidelberg
Định dạng
Số trang 3
Dung lượng 292,72 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Cur- rcnlly it docs not seem possible to precisely de- fine the notion o f core vocabulary, but it is argued that workable criteria can nevertheless be ['o1.1110.. What semantic criteria

Trang 1

T O W A R D S A C O R E V O C A I I U L A R Y F O I l A N A T U R A L I , A N G U A G E

S Y S T E M

I lubcrt l.chnmnn IBM l)cutschhmd G m b l l , Scientific ('enter institute for Knowledge Based Systems

Wilckensstr la

I)-69(10 11cidelbcrg, (;ci'many lhnaih !,1'~!! at I)! l l ) i B M I.BITNI'71'

A B S T R A C T

T h e desire to construct robust and portable na-

tural language systems has led to research o n

how a core vocabulary for such systems can be

defined Stalistical methods and semantic criteria

for doing this arc discussed and compared Cur-

rcnlly it docs not seem possible to precisely de-

fine the notion o f core vocabulary, but it is

argued that workable criteria can nevertheless be

['o1.1110 l:inally it is emplmsized that the imple-

mentation of a core vt~cabulary must be seen as

a long-range research prt~gram rather than as a

short-term goal

Motiva!ion

Rcasearch on natural language processing sys-

tems today strives for the construction o f robust

and portable systems) A system is robust, if it

can handle a large variety o f user inputs without

giving up or producing unexpected results A

system is portable in the sense intended here, if

it is not geared to a single subject domain, but

can be ported with a reasonable effort to a vail-

ely o f subjccl domains It is c o m m o n under-

standing that there cxisls a t:t:llhal [i'aglnt'.tlt o f a

language which I is required for dealing with

virtually any subject d(~main, anti 2 is invariant

with respect to meaning and use accross subject

donmins, it is of course a non-trivial empirical

question whether such a cen{ral fragment really

exists, and if so, to say what it is, but a n u m b e r

of researchers scenl tO share the asstllnption that

it does (ef e.g Alshawi ct al (19R8)) Any ro-

bust and portablc system would then have to

handle this core fi'agmcnl

in this paper I am concerned with a second

- related - assumption, namely that there ex-

ists a core vocabulary which is needed for handl-

ing any subject domain 'llfis assumpti(m is also

shared by many researchers, and it tmdcrlies the

p r o d u c t i o n o f basic vocabularies for language

learning such as Ochlcr (1980) l.Jsually the att-

thors claim that their word lists are I)ased on

statistical investigations, but they also emphasize

that they did not slavishly stick to the statistics but used additional crilcria such as "usage value", "availability", "familiarity", or

"lcamability" without ever saying how these are eslablished.2

I will address the following questions:

I l l o w can the intuilivc notion o f core vocab- ulary be properly dciincd?

2 l l o w can statistical methods be employed

to define a core vocabtdaD, and how do they relate to semantic criteria?

3 What semantic criteria can be found to de- fine a core vocabulary?

Definitions of a core vocabulary

"l:here are several ways to define core vocab,lary, I can think of the following three:

1 T h e core vocabulary consists of the n most frequent words of a hmguage

2 T h e core w~cabulary is that vocabulary which is c o m m o n to all nativc speakers o f

a language

3 T h e s e m a , t i c core vocabulary consists o f lhose words which suffice to dcfinc all o f the remaining vocabuh~ry of a language

T h e first two definitions call Re" statistical methods which shall be discussed in the next gcction, and the third onc obviously requires a .~cmanlic approach which shall bc discusscd in

~cclion " S o m a , t i c crilcria"

Statistical methods

l:rcqucncy counts have well established the basic propcrtics o f the frequency distribution for text corpora T h u s in Kuccra and l;rancis (1967) we get coverage ligures like lifts lbr their complete corl~US o f about I million tokens:

10 most frequent words:

100 most frequent words:

1000 most frequent words:

24.26 % 47.43 % 68.86 %

The research deso'ibed here has been comlucted i the eonh;xt of the I I I,O(; project (I lerzog et al., 1986)

II has profiled fi'om inlensive discussio.s with R Maye,' Mnch of the underlying statistical work on text corpora is dec Io U, Bandara and (L Walch from Ihe speech recognition project SPRIN(] (Wothke et at., 1989)

Our investigations ;ire based (.I (~erman, but for ease of refere.ce also some l!nglish examples are given

- 3 0 3 -

Trang 2

These figt,rcs vary only slightly with corlms size,

and also for German similar values are observed

! lowevcr, while coverage figures are rather stable

with respect to the n most frequent words of a

corpus, what are tile n most frequent words may

vary widely with corpora or subcorpora T w o

parameters rcsponsible for this variation are ob-

vious:

I Subject matter and

2 ' Communicative function

T h u s in the " K u l t u r " section of a newspaper

which we have analyzed we see that words like

Musik, Theater, Regisseur, etc occur with a

drastically higher freqt, ency than in the other

sections, which of course can be attributed to

subject matter 13ul personal pronouns, in par-

ticular 1st and 2nd person pronouns, also show

a much higher frequency, and this can hardly be

attrihutcd to subject matter, rather to different

communicative functi(ms of feuillet(mistie writ-

ing and say economic news

All of tiffs relates of course to tile much dis-

cussed issue of what c,mstitutes a representatitve

corpus for statistical linguistic analysis Since

specific subject matters and communicative

functions vary in importance for different speak-

ers of a language, it will be difficult if not ina-

possible to eliminate arbitrariness Rather, a

definition of representative corpus must take i n t o

account tile research goals pursued

For a natural language system which is sup-

posed to analyze and generate texts, to engage in

dialogues with users, and which is to acquire

knowledge fi'om the analysis o f definiti(ms and

rules formulaled in natural language, one needs

a corpus of texts where all these aspects are suf-

ficicntly rcprcscntcd We were able to draw upon

a wtriety of corpora none of w[fich would sh()w

all the featt, res rcquircd, but the combination o f

them seems to be quite reasonable

We conlpared Ihe fi)llowing five word lists:

I Oehlcr (1980): (;rundwortschatz consisting

of 2247 words,

2 Erk (1972): scientific texts from 34 disci-

plines, 1283 words with fl'cquency > 20,

3 Pregel/Rickhcit (1987): texts by primary

school children, 593 words with frequency

> 20,

4 S P R I N G - c o r p u s of newspaper texts, 2733

most frequent words,

5 I ) U I ) E N (1989): definitions for words be-

ginning with a, 2693 words with frequency

> 4

Frona these, word lists IL were formed consisting

of those words occurring in at least n of the ori-

ginal word lists (I < n < 5) The lengths of these

lisls are B~: 5409, !12 : 2248, l.h: 1215, B4: 565, and

B5 : 1 1 6

The size of /Is shows thai a really c o m m o n core of a varicty of texts may be extremely small, the successive losening of rcstrictimls used here allows for a balanced extension of this very smaU core The list//3 was chosen as the statistical core vocabulary serving as a base for applying se- mantic criteria, becat, se the overall core vocabu- lary was envisaged to have a size of approx 1500 words Inspection shows that m a n y intuitively basic words and very few idiosyncratic words are contained due to the method of intersecting the word lists, l lence, !1~ seems quite reasonable

Semanlic criteria

If one takes tim n most frequent words of any frequency count one will no doubt discover that these words will not exhibit a linguistic closure

in the sense that natural scntcnces can be formed with all and only the words in the set l:urther one will see that semantic relations will be in- complete T h u s o n e tinds in Oehler (1980) which

is based on the old Kacding count that weiblich (female) occurs but not its a n t o n y m re&milch (male) For a core vocabulary to bc set up for a natural language system, 1 think, tmc must strive for lingt, istic closure, since otherwise, one cnds

up with words one cannot use This means that you cannot base the core w~cabulary on fre- quency counts alone

l~urthermore, one cannot expect that one will imve just the vocabulary needed to formulate delinitions for the words in the list chosen T o avoid circularity, one will have to accept that certai,i words cannot bc defined wilhin the vo- cabulary, but one will also have to accept that for some words less than complete definitions can be given Because of this lack of delinability,

a sere'retie core wwal,llary can only be under- stood as an approximativc notion geared towards

"the best cmc can do" What one can hope to

do, is to define

1 taxonomic rdations,

2 "selcctional restrictions" or constraints on

seunauflic compatibility, and

3 meaning rules of arbitrary complexity (in- cluding classical definitions)

1 propose to formulate all of these typcs of rules

in natural language for B3 trying to stay within

at least tile vocabulary of B, , to add lhe words used in the fommlations to the original set, and continue until one cannot think of further rules

I claim that one can achieve a fixed point from where on no new words are added to tim set, and that at this m o m e n t one has reached a rather good approximation to a semantic core vocabu- lary

There is undoubtedly a relationship between frequency and semantic relevance: since taxonomic relations are often exemplified by anaphoric references, since semantic compatibil- ity constraints lead to tile co-occurrence o f ap-

Trang 3

propriate words, and since other more complex

semantic relationslfips arc bound to be exhibited

in the various threads of discourse, one has all

reason to expect a certain amount of congruence

between frequency counts and the semantic core

vocabulary as defined above

The work on fimnulating taxonomic re-

lati(,ns, semantic constraints and other meaning

rules is underway, and since it will inw,lve all of

the w)cabulary, linguistic closure will be achieved

at the same time

As an example, take a taxonomic rule for

Arm which is in Bs

(Every arm is part of a body.)

The word Kreperteil (body part) is only avail-

able in Bj and was therefore not used, or instead

o1' 7"eil one could also have used Glied(B3,

member), bu! then the rule would not have

covered arms of machines or rivers This high-

lights a big problem in the natural language for-

mt, lation of meaning rules: how is ambiguity

dealt with? Space does not permit a full dis-

cussion here, therefore suffice it to say that it is

one of our research goals to formulate meaning

rules which specify criteria for disambiguation

Linguistic description

The preceding discussion has concentrated o n

how to establish a core vocabulary Now a few

brief remarks shall follow on how the words of

the core vocabulary can bc linguistically de-

scribed

'l'he morphology of I:mguagcs such as

(]crman is well understood and has been coded

for an extendcd vocabulary in Ihc lexical data-

base of the IJ:,X project (llamett ct al., 1986)

This database also conlains dctailcd syntactic

inforuaation, in pa~l.icular on government pat-

terus

It is the description tat' lhe semanlie (and

pragmatic) properties of many words one wouhl

obviously wanl h) include in a core vocabulary

lhat will confront us with huge unsoivcd theore-

tical problems Be it modal verbs or proposi-

tional attitudes, sentence adverbs or "abstract"

nouns of various kinds, hwestigations on some

individual words havc generated heaps of litera-

ture, for others it seems that people have n o t

even dared to look at thcm l)oes this make the

enterprise of implementing a core w)cabulary a

futile one? I think not I think the implementa-

tion of a core w)cabulary should be seen as a

long-range research goal for both computational

and theoretical linguistics, and filrthermore that

natural language systems provide a good envi-

ronmcnt for doing experiments in semantics, be-

cause they encourage an integrated treatment of linguistic phenomena

Conclusions

()ur research on establishing a core vocabulary tor German in the framework of the I,I1,OG project Ires revealed that currently no absolute definition can be given, but ways have been shown how to arrive at a working dclinition with respect to the objectives of natural language sys- tems It has been shown that both statistical mclhods and semanlic criteria can, and I think, have to contribute to the establishment of a core vocabulary

The linguistic description and thus the im- plcmentation of a core vocabulary depends heavily on progress in theoretical linguistics, in particular in semantics and pragmatics, but 1 want to stress that h)cussing on a core w~cabu- lary is a fruitful way to direct linguistic research, which can be supported by the need for inte- graled treatments in natural language systems

References

Alshawi, !I., D M Carter, I van F, ijck, R C Moore, I) B Moran, ! ; C N l'crcira, and A

G S m i t h ( 1 9 8 8 ) : "Research Programme in Na- tural language Processing - Annual Report",

Nattie Project Document NA-16, ('ambridge: SR! International

Barnett, B., 11 l,chmann, M Zocppritz (1986): "A word database for natural language processing", Proceedings I lth lnterm~tional Con- ference OlZ Complllalional Lingui.t'tic,~ COI.ING86 Auyttst 25th to 29th, 1986, Bonn l,'e&,ral Repub- lit: of (;ermany 435-440

I;,rk, !! (1972): Zur Lexik wisscnschafificher I,'achtexte, Mi'mchen: I h, eber

Ilcrzog, O et al (1986): "I.II.OG l,in- guistic and Ix,gic Mcthods Ibr the ('omputa- tional Undcrshmding of (;crvnan",

IJLOG-Report Ib, Stuttgart: IBM I)eutschhmd Kucera, 1 I., W N Francis (1967): Computa- tional ,4nalysis t f Present-Day /Irnerican F, nglish

Providence, Rh Brown University Press

Oehlcr, II (1981)): KLETT Grund- und Auflmmvortschatz Deutsch Stuttgart: Klett l'regel, D., G Rickheit (1987): Der Wortschatz im Grumlschulalter I iildeshcim: ()Ires

Wothke, K., U B~mdara, J Kempf, E Keppel, K Mohr, G Walch (1989): "The SI~RING Speech Recognitk)n System for German", in: Proceedings of Eurospeech "89 VoL 2, 9-12

- 3 0 5 -

Ngày đăng: 22/02/2014, 10:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm