Báo cáo khoa học: "DICTIONARY ORGANIZATION FOR MACHINE TRANSLATION: THE EXPERIENCE AND IMPLICATIONS OF THEUMIST JAPANESE PROJECT" ppt

We concentrate here on three with specific reference to machine translation: the optimum grain-size for lexical entries, the division of information about separate languages, and the lev

Trang 1

THE EXPERIENCE AND IMPLICATIONS O F T H E U M I S T JAPANESE PROJECT

Mary McGee Wood, Elaine Pollard, Heather Horsfall, Natsuko Holden, Brian Chandler and Jeremy Carroll Centre for Computational Linguistics

UMIST, P.0 Box 88 Manchester M60 IQD U.K

ABSTRACT ~ The organization of a dictionary system

raises significant questions for all natural

language processing applications We concentrate

here on three with specific reference to machine

translation: the optimum grain-size for lexical

entries, the division of information about

separate languages, and the level of abstraction

appropriate to the task of translation These are

discussed, and the solutions implemented in the

UMIST English-Japanese translation project are

described and illustrated in detail

The importance of the dictionaries in a machine

translation system

In any machine translation system, the

dictionaries are of critical importance, from (at

least) two distinct aspects, their content and

their organization The content of the

dictionaries must be adequate in both quantity and

quality: that is, the vocabulary coverage must be

extensive and appropriately selected (cf Ritchie

1985), and the translation equivalents carefully

chosen ( c f Knowles 1982), if target language

output is to be satisfactory or indeed even

possible

The organization of a dictionary system also

raises significant questions in translation system

design The information held about lexical items

must be stored efficiently, accessed easily in a

perspicuous form by the system and by the user,

and readily extendable as and when required by the

addition either of new lexical entries to a

dictionary or of new information to existing

entries In this paper we discuss the way in which

these issues have been addressed in the design and

implementation of our English-Japanese translation

system

The UMIST J a p a n e s e p r o j e c t

At the Centre for Computational Linguistics,

we are designing and implementing an English-to-

Japanese machine translation system with mono-

lingual English interaction The project is funded

jointly by the Alvey Directorate and International

Computers Limited (ICL) The prototype system runs

on the ICL PERQ although much of the development

work has been done on a VAX 11/750, a MicroVAX II, and a variety of Sun equipment It is implemented

in Prolog,ln the interests of rapid prototyping, but intended for later optimization For development purposes we are using an existing corpus of i0,000 words of continuous prose from the PERQ's graphics documentation; in the long term,the system will be extended for use by technical writers in fields other than software, and possibly to other languages

At the time of writing, we have well- developed system development software, user interface, grammar and dictionary handling facilities, including dictionary entry in kanji, and a range of formats for output of linguistic representations and Japanese text The English analysis grammar handles almost all the syntactic structures of the corpus The transfer component and Japanese generation grammar currently handle a significant subset of their intended final coverage, and are under rapid development A facility for interactive resolution of structural ambiguity has been implemented, and the form of its surface presentation is also being refined

Foundations in linguistic theory

We are committed to active recognition of the mutual benefit of machine translation and linguistic theory, and our system has been designed as an implementation of independently motivated linguistic-theoretic descriptions The informing principles are those of modern 'lexicalist' unification-based linguistic theories: the English analysis grammar is based on Lexical-Functional Grammar (Bresnan, ed 1982) and Generalized Phrase Structure Grammar (Gazdar et al 1985), the Japanese generation grammar on Categorial Grammar (Ades & Steedman 1982, Steedman

1985, Whitelock 1986) These models share a general principle of holding as much information

as possible as properties of individual lexical items or as regularities within the lexicon, rather than in a separate component of syntactic grammar rules; our system concurs in this, as will

be detailed below

Trang 2

The demm~ds of translation

Many of the important questions in dictionary

design for machine translation are common to all

nlp applications Before describing our actual

implementation, we will briefly discuss three

issues with specific reference to translation: the

optimum grain-size for lexical entries, the

division of information about separate languages,

and the level of abstraction appropriate to the

task of translation

Firstly, what units should the entries in a

machine translation dictionary system describe? In

the interests of efficient and accurate

translation, one should try to bring together all

and only that information which is most likely to

be used together A grouping based on lexical

stems of specified category appears to be optimal

Change of verb voice or valency across translation

equivalents will not be uncommon For example, an

action with unexpressed agent will normally be

described in English with the passive, in French

by an active verb with impersonal subject, and in

Japanese by an active verb with no expressed

subject Change of lexical category is more often

not necessary; when it is, wider structural change

is likely to be involved, and is better handled by

syntactic than lexical relations

Secondly, the optimum organization of multi-

lingual information we take to be the clear

separation of source from target languages Our

analysis and generation dictionaries are purely

monolingual, with each entry including, not a

direct translation equivalent, but a pointer into

the transfer dictionary where such correspondences

are mapped For mnemonic reasons these pointers

normally take the form of the lexical stem of the

translation equivalent or gloss, but this is

purely a convenience for the user, and should not

obscure their formal nature, or the fact that

contrastive information is held only in the

transfer dictionaries

Thirdly, one must consider the level of

abstraction appropriate to the task of translation

and thus to the components of a machine

translation system Conventionally, in a bilingual

transfer system, the transfer dictionaries will

whenever possible specify correspondences between

actual words of the source and target languages,

as is done in our system (This will be discussed

and illustrated below.) However some interesting

points of principle are raised when a system

either handles more than two languages or is

interlingual in design (the two criteria are of

course orthogonal)

It is sometimes suggested, or assumed, that

the appropriate base for a machine translation

system, perhaps especially an interlingual system,

should be language-independent not just in the

sense of 'independent of any particular

language(s)' but also 'independent of language in

general', and 'knowledge-based' translation

systems using Schank's 'conceptual dependency'

framework (eg Schank & Abelson 1977) are presented

approach to be misguided The task of translation

is specifically linguistic: the objects which are represented and compared, analysed and generated are texts, linguistic objects The formal representations built and manipulated in formalized translation should therefore, to be appropriate to the task, also be specifically linguistic (cf Johnson 1985)

As well as this issue of principle, there are purely practical arguments against the use even of non-language-specific, let alone non-linguistic representations in machine translation An interllngual system must (aim to) hold in its 'dictionaries', and/or in the knowledge representation component which supplements or supplants them, any and all information which could in principle ever be needed for translation

to or from any language, while the information in

a transfer system will be decided on a need-to- know basis given the specific languages involved Thus for a transfer system the amount of dictionary information needed will be smaller, and the problem of selecting what to include will be more easily and objectively decidable, than for an interlingual system On this interpretation, it is possible in principle, although complex in practice, to construct a single unified lexicon of mappings among three or more languages which would still properly be classed as a transfer dictionary; and this task would still be simpler than the construction of a satisfactory interlingual 'lexicon'

Should one take.the further step to a fully non-linguistic inter-'lingua', the complications will ramify yet further It will be necessary to construct not only a fully adequate and genuinely neutral knowledge-base, but also lexically driven access to it, presumably through a more-or-less conventional lexicon, for each language in question, in a way which enables this language- neutral core accurately to map specific lexical equivalents across particular languages

This is not to deny that a complex and sophisticated semantics is necessary, and some recourse to world-knowledge would be helpful, for the resolution of ambiguities and the determination of correct translation equivalents

We reject only the claim that an appropriate or realistic level of underlying representation for machine translation can be either non-linguistic

or language-universal, let alone both at once

The dictionaries a n d t h e user Given these three underlying design principles - dictionary entries for lexical stems

of specified category, strictly monolingual analysis and generation dictionaries, and transfer dictionaries based on language-pair-specific information - we have tried to organize our dictionary system to offer efficient and perspicuous access to both the end-user and the

Trang 3

dictionary creation routines for our intended

monolingual end user, which elicit and encode the

values for a range of features for an open class

English "word (noun, verb, or adjective - see

Whitelock et al 1986 for details), but which do

not ask for translation equivalents in Japanese

This information is sufficient for a parse to

continue, with the word in question retained in

English or transcribed in katakana in the output

(as happens also for proper nouns)

The English entries thus created are stored

within the dictionary system in separate '.supp'

files, where they are accessible to the parser,

(thus allowing translation to continue) but

clearly isolated for later full update This will

be carried out by the bilingual linguist, who will

add an index to the transfer dictionary and create

corresponding full entries in the transfer and

Japanese dictionaries At present, during system

development, these stages are often run together

In the final version of the system, for

monollngual use, the bilingual updates will be

supplied by specialist support personnel

Although this might appear restrictive, it is

less so than the alternatives Given our objective

of offering reliable Japanese output to a

monolingual English user, we cannot expect that

user to carry out full bilingual dictionary

update Equally, we do not wish to constrain the

user to operate within the necessarily limited

vocabulary of the dictionaries supplied with the

system This organization of information goes some

way towards overcoming this dilemma, by enabling

the user to extend the available working

vocabulary without bilingual knowledge

The dictionaries, t h e user, a n d t h e system

The dictionary creation routines, whether in

monolingual mode for the end user or in bilingual

mode for the linguist, build 'neutral form'

dictionary entries consisting of a simple list of

features and values Regular inflected forms are

supplied dynamically during dictionary creation

and lookup, by running the morphological analyser

in reverse All atomic feature values are listed

explicitly This ensures that all the information

held about each word is clearly available to the

user The compilation process for these neutral

forms is so designed that values for a new feature

can be added throughout without totally rebuilding

the dictionary file in question

/.NTRIES F R O M [ D I C T I O N A R Y C R E A T I O N

n f ( [ w o r d = t r e e s , s t e m = t r e e s t e m t y p m n o u n

=ntype count,plural []])

n f ( [ w o r d = l i v e , s t e m = l i v e , s t e m t y p = v e r b ,

thirdsing=[],pres part=[],past=[],

p a s t _ p a r t = I l l )

n f ( [ s t e m = d l f f i c u l t , s t e m t y p = a d j , a d v e r b = [ ] ,

forms_comp=no])

automatically compiled into 'program form' entries

in the format expected by the parser These are kept as small as possible, firstly by storing only irregular inflected forms, as in the neutral form entries described above Secondly, we factor out predictable atomic feature values into feature co- occurrence restrictions These derive largely from the fcrs of Generalized Phrase Structure Grammar (Gazdar et al 1984), which are in fact classical redundancy rules as in Chomsky (1965), Chomsky & Halle (1968)

~ A T O - ~ E S

f e a t s e t ( d a u g h t e r s [ s u b j o b J , o b J 2 ,

p c o m p , v c o m p , e c o m p , s c o m p , ] )

~ e a t s e t ( r o l e s , [ a r g l , a r g 0 , a r g 2 , a d j u n c t ,

= 0 m p o u n d , ])

F E A T U R E C O - O C C U R R E N C E R E S T R I C T I O N S

f = r ( i n f = _ , [ f i n = n o n f i n ] )

f c r ( t e n s e = _ , [ f l n = f i n i t e , s t e m t y p = v e r b ] )

f = r ( £ i n = _ , [ ¢ a t = v e r b ] )

J f c r ( n o u n = y e s , [ v e r b = n o , a d n o m = n o ,

• t e n s e d = n o ] )

j f = r ( a d J = y e s , [ a d v e r b = n o , a d n o m = n o ,

t e n s e d = n o ] )

This is one possible implementation of the 'virtual lexicon' strategy proposed by Church

1980, and widely used since A similar technique

is used in the LRC Metal system (Slocum & Bennett 1982) The use of defaults in dictionary design for machine translation, or natural language processing in general, is a complex issue which lles beyond the scope of the present paper

Thus the maximum load is given to generalized lexical redundancy patterns rather than to individual lexical entries However this is not 'procedural' as opposed to 'declarative' It is simply a declarative statement in which the maximum number of regularities are stated explicitly as such

This two-layered dictionary structure and automatic compilation ensures that any change in the parser which implicates its dictionary format requires at most a recompilation from the neutral form rather than labour-intensive rewriting It also makes dictionary information available both

in a form perspicuous to the human user and, independently, in a form optimally adapted to the design of the parser

The dictionaries a n d the s y s t e m The program form dictionaries factor out different types of information to be invoked at different stages in parsing and interpretation of English input In the first stage, grammatical category and morphological and semantic-feature information is looked up in 'edict' dictionaries

Trang 4

NOUN

edict(file,[pred=file,cntype==ount])

e d i c t ( i n f o r m a t l o n , [ p r e d = i n f o r m a t i o n ,

cntype=mass])

edict(manual~[pred=manual_book,cntype=count]) °

ediGt(storage,[pred=storage,cntype=mass])

V E R B

e d i c t ( c o n s l s t , [ p r e d = c o n s i s t , s t e m t y p = v e r b ] }

e d i c t ( c o r r e s p o n d , [ p r e d = c o r r e s p o n d , s t e m t y p = v e r b ]

edict(provlde,[pred=provide,stemtyp=verb])

e d l c t ( p u t , [ p r e d = p u t , s t e m t y p = v e r b ] }

irreg(put,[pred=put,tense=past])

i r r e g ( p u t , [ p r e d = p u t , n f f o r m = e n ] }

edlct(be,[pred=be,block=[l,1,1,0,1,1,11 ]])

i r r e g ( a r e , [ p r e d = b e , t e n s e = p r e s , s u b ~ / a g r p l = y e s ] }

i r r e g ( b e e n , [ p r e d = b e , n f f o r m = e n ] )

.irreg(is,[pred=be,tense=pres,subj/agrpl=no]]

i r r e g ( w a s , [ p r e d = b e , t e n s e = p a s t , s u b J / a g r p l = n o ] )

irreg.(Were,[pred=be,tense=past,sub~/agrpl=yes])

edict(become,[pred=become,stemtyp=verb])

i r r e g ( b e c a m e , [ p r e d = b e c o m e , t e n s e = p a s t ] )

Irreg(becaune,[pred=become,nfform=en]}

~ d ~

ediict(graphical,[pred=graphical,stemtyp=adj])

e d i c t ( m a n u a l , [ p r e d = m a n u a l _ h a n d , s t e m t y p = a d J ] )

D E T

s t o p ( t h e , d e t , [ s p e c = d e f ] ]

S t o p ( a , d e t , [ s p e c = i n d e f , a g r p l = n o , a r t p l = n o ] )

stop(many,det,[quan=many,agrpl=yes]}

s t o p ( m u c h , d e t , [ q u a n = m u c h , a g r p l = n o ] )

s t o p ( s o m e , d e t , [ s p e c = i n d e f , a r t p l = y e s ] )

s u b c a t ( p u t , [ t r a n s , l o c g o a l ] )

~oblig(put,[arg0,arg2])

s u b c a t ( b e , [ p r e d a d j , a u x ] , p r e d a d j )

s u b c a t ( b e , [ p a s s , a u x ] , p a s s i v e )

s u b c a t ( b e , [ p r o g , a u ~ ] , p r o g )

s u b c a t ( b e , [ e x i s t , o b j e s s ] , b e _ e x i s t ) -

s u b = a t ~ b e , [ i n t r a n s , o b j e s s ] )

s u b c a t ( b e c o m e , [ i n t r a n s , o b j e s s , l o c ] )

Using this additional information, the functional structures can go through function-

a r g u m e n t mapping to produce semantic stzn/ctures for those which are valid The transfer component consists solely of a dictionary of mappings between source and target language lexical items,

or, where necessary (eg for idioms), more complex quasi-syntactic configurations

~ X A M P L E S F R O M T R A N S F E R D I C T I O N A R Y

NOUNS

x d i c t ( f i l e , f a i r u )

x d l c t ( i n f o r m a t l o n , z y o u h o u )

"xdlct(manual_book,manyuaru)

x d i c t ( s t o r a g e , k i o k u s o u t i )

V E R B S

x d i c t ( b e _ e x i s t , a , [ v m o r p h = a r u ] )

x d i c t ( b e c o m e , n a , [ g l o s s = b e c o m e ] )

x d i c t ( c o n s i s t , n a , [ g l o s s = c o n s i s t ] )

x d i c t ( p r o v i d e , s o n a e )

A D J E C T I V E S

x d i c t ( g r a p h i c a l , g u r a f i k k u )

x d i c t ( m a n u a l _ h a n d , s y u d o u )

This information is used in parsing to

produce LFG-ish functional structures Optional

and obligatory subcategorization features are then

looked up in separate 'subcat' dictionaries

Japanese generation proceeds inverse sequence

through an

,,~XA~LES F R O M S U B C A T

.PROVIDING A S U B C A T E G O R I Z A T I O N F P ~ M E

s u b c a t ( c o n s i s t , [ i n t r a n s e o £ a r g , l o c ] )

o h l i g ( c o n s i s t , [ a r g | ] }

s u b c a t ( c o r r e s p o n d , [ i n t r a n s , t o a r g , l o c ] )

s u b c a t ( p r o v i d e , [ t r a n s , f o r b e n , l o c ] )

EXAMPLES F R O M J A P A N E S E D I C T I O N A R I E S

N O U N

J d i c t ( f a i r u , [ p r e d = f a i r u , k f o r m = k a t a , g l o s s = f i l e , stemtyp=not%n])

j ~ i c t ( j o u h o u , [ p = e d = j o u h o u , k ~ o = m = ' I~ ~ ',

~ l o s s = i n f o r m a t ion, s t e m t y p = n o u n ] )

Trang 5

k f o r m = ' ~ ' , g l o s s = s t o r a g e , s t e m t y p = n o u n ]

J d i G t ( m a n y u a r ~ , [ p r e d = m a n y u a r u , k f o r m = k a t a ,

g l o s s = m a n u a l , s t e m t y p = n o u n ] )

~ d i c t ( s y u d o u , [ p r e d = s y u d o u , k f o r m = ' ~ ' ,

gloss manual,stemtyp=noun])

j d i c t ( g u r a f i k k u , [ p r e d = g u r a £ i k k u , k £ o r m = k a t a ,

g l o s s = g r a p h i c a l , s t e m t y p = n o u n ] )

U - V ~ R ~

J d i c t ( i , [ p r e d = i , ~ n o r p h = 1 - - i , k f o r m = h i r a , g l o s s = b e ,

s t e m t y p = u v e r b ] )

j d i c t ( i r e , [ p r e d = i r e , v m o r p h = 1 - e , k f o r m = ' ~ ',

g l o s s = p u t , s t e m t y p = u v e r b ] )

j d l c t ( n a , [ p r e d = n a , v m o r p h = 5 - r , k f o r m = ' ~ ' ,

g l o s s = b e c o m e , s t e m t y p = u v e r b ] )

~ d i = t ( n a , [ p r e d = n a , v m o r p h = 5 - - r , k £ o r m = ' ~ ' ,

g l o s s = c o n s i s t , s t e m t y p = u v e r b ] )

~ d i c t ( s o n a e , [ p r e d = s o n a e , % ~ n o r p h = 1 - e , k f o r m = ' ~ ' ,

g l o s s = p r o v l d e , s t e m t y p = u v e r b , t e n s e m = p t u n c t ] )

Conclusions

The organization of the dictionaries in a

machine translation system raises a number of

significant issues, some general to natural

language processing and others specific to

translation In the course of implementing our

English-Japanese system, we have arrived at one

possible set of answers to these questions, which

we hope to have shown are both computationally

practicable and of wider theoretical interest

ACKNOWLEDGEMENTS The work on which this paper is based is

supported by International Computers Limited (ICL)

and by the UK Science and Engineering Research

Council under the Alvey programme for research in

Intelligent Knowledge Based Systems We are

indebted to our present and former colleagues for

their tolerance and support, especially to Pete

Whitelock and Rod Johnson

Rein ~ E N C E S Ades, Antony, & Mark Steedman 1982 On the Order of Words L i n ~ s t i c s and Philosophy

Bresnan, Joan, ed 1982 The Mental Representation of Grammatical Relations MIT Press, Cambridge, Mass

Chomsky, Noam 1965 Aspects of the T h e o r y o f Syntax MIT Press, Cambridge, Mass

Chomsky, Noam, & Morris Halle 1968 The Sound Pattern of English Harper & Row, New York Church, Kenneth 1980 On Memory L~-~tations

in Natumal Language Processing MIT Report

MITILCS/TR-245

Gazdar, Gerald, Ewan Klein, Geoff Pullum, & Ivan Sag 1984 Generalized Phrase Structure Gr-mm~r Blackwells, Oxford

Johnson, R L 1985 Translation In Whitelock et al, eds

Knowles, Francis 1982 The Pivotal Role of the Dictionaries in a Machine Translation System

In Lawson, Veronica, ed Practical Experience of Machine Translation North-Holland

Nirenberg, Sergei 1986 Machine Translation Tutorial Introduction, ACL 1986, New York

Ritchie, Graeme 1985 The Lexicon In Whitelock et 8_1, eds

Schank, Roger, & Robert Abelson 1977 Scripts, Plans, Goals and Understanding Erlbaum Slocum, Jonathan, and W S Bennett 1982 The LRC Machine Translation System Working Paper LRC-82-1, LRC, University of Texas, Austin

Steedman, Mark 1985 Dependency and Coordination in the Grammar of Dutch and English

Language

Whitelock, Peter 1986 A Categorial-like Morpho-syntax for Japanese

Whitelock, Peter, Mary McGee Wood, Brian Chandler, Natsuko Holden, & Heather Horsfall

1986 Strategies for Interactive Machine Translation Proceedings of Coling86

Whitelock, Peter, Mary McGee Wood, Harold Somers, R L Johnson, & Paul Bennett, eds Forthcoming Linguistic Theory and Computer Applications Academic Press, London

Tiêu đề	Dictionary organization for machine translation: The experience and implications of the umist japanese project
Tác giả	Mary McGee Wood, Elaine Pollard, Heather Horsfall, Natsuko Holden, Brian Chandler, Jeremy Carroll
Trường học	University of Manchester
Chuyên ngành	Computational Linguistics
Thể loại	báo cáo khoa học
Thành phố	Manchester

Định dạng
Số trang	5
Dung lượng	417,09 KB