1. Trang chủ
  2. » Luận Văn - Báo Cáo

Báo cáo khoa học: "AUTOMATED DETERMINATION OF SUBLANGUAGE" doc

5 295 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Automated determination of sublanguage
Tác giả Ralph Grbhman, Ngo Thanh Nhan
Người hướng dẫn Elalne Marsh
Trường học New York University
Chuyên ngành Mathematical Sciences
Thể loại báo cáo khoa học
Thành phố New York
Định dạng
Số trang 5
Dung lượng 300,02 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

The parse file for the patient documents had correct parses for 236 sentences and sentence frag- ments; the file for the CASREPS had correct parses tor 123 sentences.. The grammar for th

Trang 1

A U T O M A T E D D E T E R M I N A T I O N O F S U B L A N G U A G E S Y N T A C T I C U S A G E

Ralph Grbhman and Ngo Thanh Nhan

Courant Institute of Mathematical Sciences

New York University New York, NY 10012

Elalne Marsh

Navy Center for Applied 1~se, arch in ~ I n t e l ~

Naval ~ Laboratory Wx,~hinm~, DC 20375

Lynel~ Hirxehnum

Research and Development Division System Development Corpmation / A Burroughs Company

Paofi, P A 19301

Abstract

Sublanguages _differ from each other, and from the "stan-

dard I a n ~ a g e , in their syntactic, semantic, and

discourse vrolx:rties Understanding these differences is

important'if -we are to improve our ability to process

these sublanguages We have developed a sen~.'-

automatic ~ u r e for identifying sublangnage syntact/c

usage from a sample of text in the sublanguage We

describe the results of applying this procedure to taree

text samples: two sets of medical documents and a set of

equipment failure me~ages

Introduction

y a oommumty ot s ~ t s m a t m ~ m g a resmctea

domain Sublanguages differ from each other, and tron}

the "standard language, in their syntactic, ~ a n t i c , anti

discourse properties We describe ~ some rec~.t

work on (-senii-)automatically determining the.syntactic_

properties of several sublangnages This work m part ot

a larger effort aimed at improving the techniques for

parsing sublanguages

If we esamine a variety of scientific and technical

sublanguages, we will encounter most of the constructs of

the standard language, plus a number of syntactic exten-

sions For example, report" sublantgnag ~ , such as are

used in medical s||mmarles and eqmpment failure sum-

maries, include both full sentences and a number of ~ag-

merit forms [Marsh 1983] Specific sublanguages differ

in their usage of these syntactic constructs [Kittredge

1982, Lehrberger 1982]

Identifying these differences is important in under-

standing how sublanguages differ from the Language as a

whole It also has immediate practical benefits, since it

allows us to trim our grammar tO fit the specific sub-

language we are processing This can significantly speed

up the analysis process and bl~.k some spurious parses

which wouldbe obtained with a grammar of Overly broad

coverage

Determining Syntaai¢ Usage Unf ort~natcly, a~l uirin~ the data about , y n ~ ' c usage can De very te~ous, masmuca ~ st reqmres me analysis of hundreds (or even thousands) of s~ fence., for each new sublangnage to.be proces ~i We nave mere- fore chosen to automate this process

We are fortunate to have available to us a very broad coverage English grammar, the Linguistic.St~ing

Grammar [S~gor 1981], which hp been ex~ d ~

include the sentence fragn~n_ ts of certain medical aria cquilnnent failure rcixn'm [Marsh 1983] The gram, ," consmts of a context-~r=, component a.ugmehtc~l by

p r ~ u r a l restrictions which capture v_.anous synt.t.t ~

and sublanguage _semantic cons_tt'aints "l]~e c o n ~ -

component is stated in terms ot lgra.mmatical camgones such as noun, tensed verb, and ad~:tive

To be gin the analysis proceSS, a sample mrpus is

usmg this gr~,-=-,: The me of generanm par~s_

m reviewed manually to eliminate incorrect ~ x ne remalningparses are then fed to a program which cc~ts for each parse tree and cumulatively for ~ entb'e me .- the number of times that each production m me context-free component of the grammar was applied in building the tr¢~ This yields a "trimmed" context-fr¢~ grammar for the sublangua!~e (consLsting ~ ~ o s c pro- ductions usea one or more tunes), atong w~m zrequency information on the various productions

This process was initially applied to text sampl~ from two Sublanguages The fi~s t is a set o.x s ~ pauent documents (including patient his.tm'y., eTam,n.ation, and plan of treatment) The second m a set ot electrical equipment failure relxals called "CASREPs', a class of operational report used by the U S Navy [Froscher 1983] The parse file for the patient documents had correct parses for 236 sentences (and sentence frag- ments); the file for the CASREPS had correct parses tor

123 sentences We have recently applied the process, to a third text sample, drawn from a subIanguage very stmflar

to the first: a set of five hospital discharge summaries , Which include patient histories, e~nmlnnt[ous, and sum- maries of the murse of treatment in the hospital This last sample included correct parses for 310 sentences

Trang 2

Results

The trimmed grarnrtl~l~ ~ d u ~ from thc three

sublanguage text samples were of comparable size The

grammar produced from the first set of patient docu-

menU; col~tained 129 non-termlnal symbols and 248 pro-

ductions; the grnmmar from the second set (the

"discharge summaries") Was Slightly ]~trger, with 134

non-termin~ds and 282 productions The grammar for the

CASREP sublanguage was slightly smaller, with 124

non-terminal~ and 220 productions (this is probably a

reflection of the smaller size of the C A S R text sam-

ple) These figures compare with 255 non-termlnal sym-

bols and 744 productions in the "medical records" gram-

mar used by the New York University Linguistic String

Pro~=t (the "medical records" grammar iS the Lingttistic

String Project English Grammar with extensions for sen-

tencc fragments and other, sublanguagc specific, con-

structs, and with a few options deleted)

Figures 1 and 2 show the cumulative growth in the

size of the I~"immed grammars for the three sublanguages

as a function of the number of sentences in the sample

In Ftgure 1 we plot the number of non-term/hal symbols

in the grammar as a function of sample size; in Figure 2,

the number of productions in the ~ as a function

of sample size Note that the curves for the two medical

sublanguages (curves A and B) have pretty much fiat-

tcned out toward the end, indicating that, by that point,

the trimmed grnmm~tr COVe'S a V~"y l a r ~ f r a ~ o n of the

sentences in the sublanguage (Some o f the jumps in the

growth curves for the medical grAmmarS refleet the ~vi-

sion of the patient documents into sections (history, pl3y-

sical exam, lab tests, etc.) with different syntactic charac-

teristics For the first few documents, wl3en a new see-

tion bedim, constructs are encountered which did not

appear m prior sections, thus producing a jump in the

c11rve.)

The sublanguage gramma~ arc substantially smaller

than the full English grammar, reflecting the more lim-

itcd range of modifiers and complements in these sub-

languages While the full grammar has 67 options for

sentence object, the sublanguage grammars have substan-

tially restricted mages: each of the three sublanguage

grammars has only 14 object options Further, the gram-

mars greatly overlap, so that the three grammars com-

bined contain only 20 different object options While

sentential complements of nouns are available in the full

grammar, there arc no i ~ t a n c ~ of such a:~[lstrllcfions in

either medical sublanguage, aad only one instance in the

CASREP sublanguage T h e range of modifiers iS also

much restricted i a the sublangu=age grammars as com-

pared to the full grammar 15 options for sentential

modifiers are available in the full grammar These are

restricted to 9 in the first medical sample, 11 in the

second, and 8 in the equipment failure sublangua~e

Similarly, the full English gr~mmnr has 21 options tor

right modifiers of nouns; the sublanguage gr~mma_~S had

fewer, 11 in the first medical sumple, I0 m" the second,

and 7 in the CASREP sublanguage Here the sub-

language grammars overlap almost completely: only 12

different right modifiers o f noun are represented in the

three grammars combined

Among the options occurring in all the sublanguage

grammars, their relative frequency varies ao~o~ding to

the domain of t h e text For example, the frequency of

prepositional phrases as right modifiers of nouns (meas; urea as instances per sentence or sentence fragment) was

0.36 and 0.46 for the two medical samples, as compared

to 0.77 for the CASREPs More striking was the fre- quency of noun phrases with nouns as modifiers of other nouns: 0.20 and 0.32 for the two medical ~mples, versus 0.80 for the CASREPs

We reparsed some of the sentences from the first set

of medical documents with the trimmed grammar and, as

~ , o.bserved a considerable " speed-up The t.mgumuc ~mng rarser uses a p.op-uown pa.~mg algo- rithm w i t h , ba~track~" g A,~Ldingly , for short, simple sentences which require little backtr~.king there was only

a small gain in processing speed (about 25%) For long, complex sentences, however, which require extensive backtracking, the speed-up (by roughly a factor of 3) was approximately proportional to the reduction in the number of productions In addition, the ~fyequcncy of bad parses decreased slightly (by <3%) with the l~mmed y m m r (because some of the bad parses involved syntactic constructs which did not appear m any o~,,~ect parse in the sublanguage sample)

Discussion

As natural lan ~,uage interfaces become more mature, their portability - the ability to move an inter- face to a new domain and sublenguage - is becoming increasingly important At 8 minimllm, portability requires us to isolate the domain dependent information

in a natural ]aDgua.~.e system [C~OSZ 1983, Gri~hman 1983] A more ambitious goal m to provide a discovery

procedure for this information - a procedure Wl~eh can

determine the domain dependent information from sam- ple texts in the sublanguage The tcchnklUeS described above provide a partial, semi-automatic discovery pro- cedure for the syntactic usages of a sublangua~.* By applying these t ~ g u e s to a small s u b l a n ~ sample,

we ~ adapt a broad-coverage grammar tO the syntax of

a particular sublanguage Sub~.quont text from this sub- language caa then be i~xessed more efficiently

We are currently extending this work in two direc- tions For sentences with two or more parses which

~ atisfy both the syntactic and the sublanguage selectional semanu.'c) constraints, we intena to try using t h e / r e - Cency information ga~ered for productions to select, a

invol "ving the more frequent syntactic constructs.** Second, we are using a s~milAr approach to develop a discovery procedure for sublanguage selectional patterns

We are collecting, from the same sublanguage samples, statistics on the frequency of co-occurrence of particular sublan guage (semantic) classes in subjeet.vedy.ob~:ct and host-adjunct relations, and are using this data as input to

* Partial, because it cannot identify new extensions

to the base gramme; semi-automatic, because the parses produced with the broad-coverage grammar

• must be manually reviewed

* Some small experiments of this type have been one with a Japanese ~ [Naga 0 1982] with 1|mired success B e c a t ~ of the v ~ _ differ~t na- ture of the grammar, however, it is not dear whether this lass any implications for our experi- ments

97

Trang 3

the grammar's sublanguage selectional restrictions

Acknowledgemeat

This material is based upon work supported by the Nalional Science Foundation under Grants No MCS-82-

02373 and MCS-82-02397

Referenem

[Frmcher 1983] Froscher, J.; Grishmau, R.; Bachenko, J.; Marsh, E "A linguistically motivated approach to automated analysis of military messages." To appear in

Proc 1983 Conf on Artificial Intelligence, Rochester, MI, April 1983

man, C "Isolating domain dependencies in natural

language interface P r o c Conf Applied Natural

Linguistics, 1983

[Greu 1963] Grosz, B "TEAM: a transportable

natural-language interface system," Proc Conf Applied

fional IAnguhflm, 1983

[Kittredge 1982] Kim-edge, 11 "Variation and homo- geneity of sublauguages3 In Sublanguage: Jmdies of

and J Lehrberger Berlin & New York: Walter de Gruyter; 1982

on and the concept of sublanguage In $ublan~a&e: sl~lies of language in restricted semantic domains, ed R

Kittredge and J Lehrberger Berlin & New York: Walter de Gruyter; 1982

[Marsh 1983] Marsh, E "Utilizing domain-specific

information for processing compact text." Proc Conf ied Namra[ Lansuage Processing, 99-103, Assn for

putational Linguistics, 1983

[Nape 1982] Nagao, M.; Nakamura, J "A parser which learns the application order of rewriting rules."

Proc COLING 82, 253-258

[Sager 1981] Sager, N Natural Lansuage lnform~on Pro-

Trang 4

130

120

110

100

80

8 0

90

60

50

40

30 0

• ' • ' " ' ' , ' , " , • , • , • , • I • v " r

2-

Y

A

, i , , I / , i i , i , i , ) , i

z ° ~ l o 8 0 o o I o o 1 2 o 14o 1 8 o 1 8 o z o o z z o z 4 o

x

Figure 1 Growth in thc size of the gr~mm.r

as a function of the size of the text sample X

= the number of sentences (and sentence frag-

ments) in the text samplc; ~" = the number of

non-terminal symbols m the context-free com-

ponent of thc ~'ammar

Graph A: first set of patient documents

Graph B: second set of pat/cnt documcnts

("discharge s-~-,-,'ics")

Graph C: e~, uipment failure messages

1 4 0

1 3 0 1:)0

110

1 0 0

gO 8O

90

3 0

SENTENCES V S NON-TERMINRL 5YHBBLS

f

/

B

S O , , • , , l , , , , , , , ,

0 ZO 4 0 6 0 8 0 1 0 0 I Z O 1 4 0 1 3 0 1 8 0 ZOO ZZO 2 4 0 Z 6 0 ZSO 3 0 0 3ZO

X

1so

12o

11o

SENTENCES V S N ~ N - T E R M I N R L SYMBOLS

• e • , , l • , • l , , • , , , , , , , ,

J

/

J

/ - - '

/

, , v ,

lOO

8 0

) 80

70

80

4 O

• * , , • I s I , i , : * f , i , i • * , , * , •

3 0 0 1 0 ZO 3 0 4 0 3 0 6 0 7 0 3 0 ~ 0 1 0 0 1 1 0 1 2 0 1 ~ 0

X

99

Trang 5

3 0 O

2 0 0

Z S O

• , [ • , , • , , , , • , ,

, _/7

A

J

,,, , ~ ,

~ 0 4 0 6 1 0 0 1 2 Q 1 4 0 1150 1 8 0 Z O O Z Z O Z ~ O

X

Figure 2 Growth in the size of the grammar

as a fuaction of the size of thc text sample X

= the number of sentences (and sentence frag-

ments) in the text sample; Y = the number of

productions in the context-free component of

the grammar

Graph A: first set of patient documents

Graph B: second set of pati_e~.t documents

("discharge s.~,-,,~cs )

Graph C: e~, ,uipment failure messages

(cAs~,Ps-)

2 2 0

2 0 O

1 8 0

2 ~

2 2 0

2 ( 3 0

=,- 1 0 0

1 8 0

Z 4 0

S E N T E N C E S V S P R O D U C T I ° ' I N S

" , 1 , i • i • , • a , i • J , , , i , i , J i • J , • i ,

2 6 0

2 4 0

2 2 0

2 0 0

1 8 0

1 6 G

1 4 0

1 2 0 lOG

8 0

8 0

4 0

J

t 2 Q

8 0

6 0 , * , J i • i , i , i i i , , , i , , , B ,

O Z O 4 0 6 0 OO 1 0 0 1 2 0 1 " i 0 1 5 0 1 5 0 Z O O 2 2 0 Z ~ O Z S O Z S O 3 0 O 3 2 O

X

S E N T E N C E S V S P R g D U C T I ° I N S

1 6 0

1 4 0

1 0 0

O 0 /

C

6 O

Z O o 1 0 Z O 3 0 4 0 O 0 ~ 0 t O 0 ; 1 0 I Z O

X

Ngày đăng: 24/03/2014, 01:21

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm