The parse file for the patient documents had correct parses for 236 sentences and sentence frag- ments; the file for the CASREPS had correct parses tor 123 sentences.. The grammar for th
Trang 1A U T O M A T E D D E T E R M I N A T I O N O F S U B L A N G U A G E S Y N T A C T I C U S A G E
Ralph Grbhman and Ngo Thanh Nhan
Courant Institute of Mathematical Sciences
New York University New York, NY 10012
Elalne Marsh
Navy Center for Applied 1~se, arch in ~ I n t e l ~
Naval ~ Laboratory Wx,~hinm~, DC 20375
Lynel~ Hirxehnum
Research and Development Division System Development Corpmation / A Burroughs Company
Paofi, P A 19301
Abstract
Sublanguages _differ from each other, and from the "stan-
dard I a n ~ a g e , in their syntactic, semantic, and
discourse vrolx:rties Understanding these differences is
important'if -we are to improve our ability to process
these sublanguages We have developed a sen~.'-
automatic ~ u r e for identifying sublangnage syntact/c
usage from a sample of text in the sublanguage We
describe the results of applying this procedure to taree
text samples: two sets of medical documents and a set of
equipment failure me~ages
Introduction
y a oommumty ot s ~ t s m a t m ~ m g a resmctea
domain Sublanguages differ from each other, and tron}
the "standard language, in their syntactic, ~ a n t i c , anti
discourse properties We describe ~ some rec~.t
work on (-senii-)automatically determining the.syntactic_
properties of several sublangnages This work m part ot
a larger effort aimed at improving the techniques for
parsing sublanguages
If we esamine a variety of scientific and technical
sublanguages, we will encounter most of the constructs of
the standard language, plus a number of syntactic exten-
sions For example, report" sublantgnag ~ , such as are
used in medical s||mmarles and eqmpment failure sum-
maries, include both full sentences and a number of ~ag-
merit forms [Marsh 1983] Specific sublanguages differ
in their usage of these syntactic constructs [Kittredge
1982, Lehrberger 1982]
Identifying these differences is important in under-
standing how sublanguages differ from the Language as a
whole It also has immediate practical benefits, since it
allows us to trim our grammar tO fit the specific sub-
language we are processing This can significantly speed
up the analysis process and bl~.k some spurious parses
which wouldbe obtained with a grammar of Overly broad
coverage
Determining Syntaai¢ Usage Unf ort~natcly, a~l uirin~ the data about , y n ~ ' c usage can De very te~ous, masmuca ~ st reqmres me analysis of hundreds (or even thousands) of s~ fence., for each new sublangnage to.be proces ~i We nave mere- fore chosen to automate this process
We are fortunate to have available to us a very broad coverage English grammar, the Linguistic.St~ing
Grammar [S~gor 1981], which hp been ex~ d ~
include the sentence fragn~n_ ts of certain medical aria cquilnnent failure rcixn'm [Marsh 1983] The gram, ," consmts of a context-~r=, component a.ugmehtc~l by
p r ~ u r a l restrictions which capture v_.anous synt.t.t ~
and sublanguage _semantic cons_tt'aints "l]~e c o n ~ -
component is stated in terms ot lgra.mmatical camgones such as noun, tensed verb, and ad~:tive
To be gin the analysis proceSS, a sample mrpus is
usmg this gr~,-=-,: The me of generanm par~s_
m reviewed manually to eliminate incorrect ~ x ne remalningparses are then fed to a program which cc~ts for each parse tree and cumulatively for ~ entb'e me .- the number of times that each production m me context-free component of the grammar was applied in building the tr¢~ This yields a "trimmed" context-fr¢~ grammar for the sublangua!~e (consLsting ~ ~ o s c pro- ductions usea one or more tunes), atong w~m zrequency information on the various productions
This process was initially applied to text sampl~ from two Sublanguages The fi~s t is a set o.x s ~ pauent documents (including patient his.tm'y., eTam,n.ation, and plan of treatment) The second m a set ot electrical equipment failure relxals called "CASREPs', a class of operational report used by the U S Navy [Froscher 1983] The parse file for the patient documents had correct parses for 236 sentences (and sentence frag- ments); the file for the CASREPS had correct parses tor
123 sentences We have recently applied the process, to a third text sample, drawn from a subIanguage very stmflar
to the first: a set of five hospital discharge summaries , Which include patient histories, e~nmlnnt[ous, and sum- maries of the murse of treatment in the hospital This last sample included correct parses for 310 sentences
Trang 2Results
The trimmed grarnrtl~l~ ~ d u ~ from thc three
sublanguage text samples were of comparable size The
grammar produced from the first set of patient docu-
menU; col~tained 129 non-termlnal symbols and 248 pro-
ductions; the grnmmar from the second set (the
"discharge summaries") Was Slightly ]~trger, with 134
non-termin~ds and 282 productions The grammar for the
CASREP sublanguage was slightly smaller, with 124
non-terminal~ and 220 productions (this is probably a
reflection of the smaller size of the C A S R text sam-
ple) These figures compare with 255 non-termlnal sym-
bols and 744 productions in the "medical records" gram-
mar used by the New York University Linguistic String
Pro~=t (the "medical records" grammar iS the Lingttistic
String Project English Grammar with extensions for sen-
tencc fragments and other, sublanguagc specific, con-
structs, and with a few options deleted)
Figures 1 and 2 show the cumulative growth in the
size of the I~"immed grammars for the three sublanguages
as a function of the number of sentences in the sample
In Ftgure 1 we plot the number of non-term/hal symbols
in the grammar as a function of sample size; in Figure 2,
the number of productions in the ~ as a function
of sample size Note that the curves for the two medical
sublanguages (curves A and B) have pretty much fiat-
tcned out toward the end, indicating that, by that point,
the trimmed grnmm~tr COVe'S a V~"y l a r ~ f r a ~ o n of the
sentences in the sublanguage (Some o f the jumps in the
growth curves for the medical grAmmarS refleet the ~vi-
sion of the patient documents into sections (history, pl3y-
sical exam, lab tests, etc.) with different syntactic charac-
teristics For the first few documents, wl3en a new see-
tion bedim, constructs are encountered which did not
appear m prior sections, thus producing a jump in the
c11rve.)
The sublanguage gramma~ arc substantially smaller
than the full English grammar, reflecting the more lim-
itcd range of modifiers and complements in these sub-
languages While the full grammar has 67 options for
sentence object, the sublanguage grammars have substan-
tially restricted mages: each of the three sublanguage
grammars has only 14 object options Further, the gram-
mars greatly overlap, so that the three grammars com-
bined contain only 20 different object options While
sentential complements of nouns are available in the full
grammar, there arc no i ~ t a n c ~ of such a:~[lstrllcfions in
either medical sublanguage, aad only one instance in the
CASREP sublanguage T h e range of modifiers iS also
much restricted i a the sublangu=age grammars as com-
pared to the full grammar 15 options for sentential
modifiers are available in the full grammar These are
restricted to 9 in the first medical sample, 11 in the
second, and 8 in the equipment failure sublangua~e
Similarly, the full English gr~mmnr has 21 options tor
right modifiers of nouns; the sublanguage gr~mma_~S had
fewer, 11 in the first medical sumple, I0 m" the second,
and 7 in the CASREP sublanguage Here the sub-
language grammars overlap almost completely: only 12
different right modifiers o f noun are represented in the
three grammars combined
Among the options occurring in all the sublanguage
grammars, their relative frequency varies ao~o~ding to
the domain of t h e text For example, the frequency of
prepositional phrases as right modifiers of nouns (meas; urea as instances per sentence or sentence fragment) was
0.36 and 0.46 for the two medical samples, as compared
to 0.77 for the CASREPs More striking was the fre- quency of noun phrases with nouns as modifiers of other nouns: 0.20 and 0.32 for the two medical ~mples, versus 0.80 for the CASREPs
We reparsed some of the sentences from the first set
of medical documents with the trimmed grammar and, as
~ , o.bserved a considerable " speed-up The t.mgumuc ~mng rarser uses a p.op-uown pa.~mg algo- rithm w i t h , ba~track~" g A,~Ldingly , for short, simple sentences which require little backtr~.king there was only
a small gain in processing speed (about 25%) For long, complex sentences, however, which require extensive backtracking, the speed-up (by roughly a factor of 3) was approximately proportional to the reduction in the number of productions In addition, the ~fyequcncy of bad parses decreased slightly (by <3%) with the l~mmed y m m r (because some of the bad parses involved syntactic constructs which did not appear m any o~,,~ect parse in the sublanguage sample)
Discussion
As natural lan ~,uage interfaces become more mature, their portability - the ability to move an inter- face to a new domain and sublenguage - is becoming increasingly important At 8 minimllm, portability requires us to isolate the domain dependent information
in a natural ]aDgua.~.e system [C~OSZ 1983, Gri~hman 1983] A more ambitious goal m to provide a discovery
procedure for this information - a procedure Wl~eh can
determine the domain dependent information from sam- ple texts in the sublanguage The tcchnklUeS described above provide a partial, semi-automatic discovery pro- cedure for the syntactic usages of a sublangua~.* By applying these t ~ g u e s to a small s u b l a n ~ sample,
we ~ adapt a broad-coverage grammar tO the syntax of
a particular sublanguage Sub~.quont text from this sub- language caa then be i~xessed more efficiently
We are currently extending this work in two direc- tions For sentences with two or more parses which
~ atisfy both the syntactic and the sublanguage selectional semanu.'c) constraints, we intena to try using t h e / r e - Cency information ga~ered for productions to select, a
invol "ving the more frequent syntactic constructs.** Second, we are using a s~milAr approach to develop a discovery procedure for sublanguage selectional patterns
We are collecting, from the same sublanguage samples, statistics on the frequency of co-occurrence of particular sublan guage (semantic) classes in subjeet.vedy.ob~:ct and host-adjunct relations, and are using this data as input to
* Partial, because it cannot identify new extensions
to the base gramme; semi-automatic, because the parses produced with the broad-coverage grammar
• must be manually reviewed
* Some small experiments of this type have been one with a Japanese ~ [Naga 0 1982] with 1|mired success B e c a t ~ of the v ~ _ differ~t na- ture of the grammar, however, it is not dear whether this lass any implications for our experi- ments
97
Trang 3the grammar's sublanguage selectional restrictions
Acknowledgemeat
This material is based upon work supported by the Nalional Science Foundation under Grants No MCS-82-
02373 and MCS-82-02397
Referenem
[Frmcher 1983] Froscher, J.; Grishmau, R.; Bachenko, J.; Marsh, E "A linguistically motivated approach to automated analysis of military messages." To appear in
Proc 1983 Conf on Artificial Intelligence, Rochester, MI, April 1983
man, C "Isolating domain dependencies in natural
language interface P r o c Conf Applied Natural
Linguistics, 1983
[Greu 1963] Grosz, B "TEAM: a transportable
natural-language interface system," Proc Conf Applied
fional IAnguhflm, 1983
[Kittredge 1982] Kim-edge, 11 "Variation and homo- geneity of sublauguages3 In Sublanguage: Jmdies of
and J Lehrberger Berlin & New York: Walter de Gruyter; 1982
on and the concept of sublanguage In $ublan~a&e: sl~lies of language in restricted semantic domains, ed R
Kittredge and J Lehrberger Berlin & New York: Walter de Gruyter; 1982
[Marsh 1983] Marsh, E "Utilizing domain-specific
information for processing compact text." Proc Conf ied Namra[ Lansuage Processing, 99-103, Assn for
putational Linguistics, 1983
[Nape 1982] Nagao, M.; Nakamura, J "A parser which learns the application order of rewriting rules."
Proc COLING 82, 253-258
[Sager 1981] Sager, N Natural Lansuage lnform~on Pro-
Trang 4130
120
110
100
80
8 0
90
60
50
40
30 0
• ' • ' " ' ' , ' , " , • , • , • , • I • v " r
2-
Y
A
, i , , I / , i i , i , i , ) , i
z ° ~ l o 8 0 o o I o o 1 2 o 14o 1 8 o 1 8 o z o o z z o z 4 o
x
Figure 1 Growth in thc size of the gr~mm.r
as a function of the size of the text sample X
= the number of sentences (and sentence frag-
ments) in the text samplc; ~" = the number of
non-terminal symbols m the context-free com-
ponent of thc ~'ammar
Graph A: first set of patient documents
Graph B: second set of pat/cnt documcnts
("discharge s-~-,-,'ics")
Graph C: e~, uipment failure messages
1 4 0
1 3 0 1:)0
110
1 0 0
gO 8O
90
3 0
SENTENCES V S NON-TERMINRL 5YHBBLS
f
/
B
S O , , • , , l , , , , , , , ,
0 ZO 4 0 6 0 8 0 1 0 0 I Z O 1 4 0 1 3 0 1 8 0 ZOO ZZO 2 4 0 Z 6 0 ZSO 3 0 0 3ZO
X
1so
12o
11o
SENTENCES V S N ~ N - T E R M I N R L SYMBOLS
• e • , , l • , • l , , • , , , , , , , ,
J
/
J
/ - - '
/
, , v ,
lOO
8 0
) 80
70
80
4 O
• * , , • I s I , i , : * f , i , i • * , , * , •
3 0 0 1 0 ZO 3 0 4 0 3 0 6 0 7 0 3 0 ~ 0 1 0 0 1 1 0 1 2 0 1 ~ 0
X
99
Trang 53 0 O
2 0 0
Z S O
• , [ • , , • , , , , • , ,
, _/7
A
J
,,, , ~ ,
~ 0 4 0 6 1 0 0 1 2 Q 1 4 0 1150 1 8 0 Z O O Z Z O Z ~ O
X
Figure 2 Growth in the size of the grammar
as a fuaction of the size of thc text sample X
= the number of sentences (and sentence frag-
ments) in the text sample; Y = the number of
productions in the context-free component of
the grammar
Graph A: first set of patient documents
Graph B: second set of pati_e~.t documents
("discharge s.~,-,,~cs )
Graph C: e~, ,uipment failure messages
(cAs~,Ps-)
2 2 0
2 0 O
1 8 0
2 ~
2 2 0
2 ( 3 0
=,- 1 0 0
1 8 0
Z 4 0
S E N T E N C E S V S P R O D U C T I ° ' I N S
" , 1 , i • i • , • a , i • J , , , i , i , J i • J , • i ,
2 6 0
2 4 0
2 2 0
2 0 0
1 8 0
1 6 G
1 4 0
1 2 0 lOG
8 0
8 0
4 0
J
t 2 Q
8 0
6 0 , * , J i • i , i , i i i , , , i , , , B ,
O Z O 4 0 6 0 OO 1 0 0 1 2 0 1 " i 0 1 5 0 1 5 0 Z O O 2 2 0 Z ~ O Z S O Z S O 3 0 O 3 2 O
X
S E N T E N C E S V S P R g D U C T I ° I N S
1 6 0
1 4 0
1 0 0
O 0 /
C
6 O
Z O o 1 0 Z O 3 0 4 0 O 0 ~ 0 t O 0 ; 1 0 I Z O
X