The original purpose of the book was to introduce a standard nomenclature of disease names, and the attributes are organized in conventional medical form: a definition consists of a brie
Trang 1M.S Blois, D.D Sherertz, M.S Tuttle Section on Medical Information Science University of Calife.rnia, San Francisco Experiments were conducted on a book, Current Medical
information and Terminology, (AMA, Chicago, 1971, edited
by Burgess Gordon, M.D.), which is a compendium of 3262
diseases, each of which is defined by a collection of
attributes The original purpose of the book was to
introduce a standard nomenclature of disease names, and
the attributes are organized in conventional medical
form: a definition consists of a brief description of
the relevant symptoms, signs, laboratory findings, and
the like Each disease is, in addition, assigned to one
(or at most two) of eleven disease categories which en-
umerate physiological systems (skin, respiratory, card-
Lovascular, etc.) While the editorial style of the
book is highly telegraphic, with many attributes being
expressed as single words, it is nevertheless easily
readable (see Figure 1)
The vocabulary employed consists of about 19,000 distinct
“words" (determined by a lexical definition), roughly
divided equally between common English words and medical
terms We measured word frequency by "disease occur-
rence", (the number of disease definitions in which a
given word occurs one or more times) Sy this measure,
only seven words occurred in more than half the disease
definitions, and about 40% of the vocabulary occurred in
words at the top of the frequency list together with the
number of occurrences.)
Assisted by the facilities of the TMoyrx operating sys-
tem, we created a series of inverted files (from a
magnetic tape of the CMIT text), and developed a set of
interactive programs to form 4 word-and-<context query
system This system has enabled us to study the problem
of inferring term reference in this large sample of text
(some 333,000 word occurrences), within the context of
diseases
An interesting early result was the ease with which many
medical terms could be algorithmically separated from
common English words After adjusting for the fact that
some disease categories are larger than others, we de-
fined an entropy~like measure of the distribution of
word occurrences over the eleven physiological categor-
ies as a measure of category specificity We reasoned
that some medical terms such as ‘murmur’, while not
specific to any particular heart disease, are specific
to heart disease generally This term would not, for
example, be used in describing endocrine disorders
Such a word would be expected to occur in category 04
(cardiovascular disease) frequentiy, and not in the
other categories Such a term would, by our measure,
have a low ‘entropy’ A common English word like ‘of',
would be used in the descriptions of all kinds of dis-
ease, and would accordingly have a high ‘entropy’
Tables 2 and 3 show the top and bottom of the list of
all words occurring in two or more diseases sorted by
this entropy measure In these lists, as our hypothesis
seems to imply, low ‘entropy' corresponds to high
‘specificity', and high 'entropy' to low 'specificity'
This separation of medical terms from common English
words, by algorithmic means, is facilitated by the
context supplied by the notion of "disease category’,
and the fact that this was represented in the CMIT text
* This work was supported in part by grants from The
Commonwealth Fund, and from the National Library of
Medicine (1 K10 LM00014)
Our second experiment investigated the co-occurrence properties of some medical terms Aware that many medi- cal diagnostic programs have assumed attribute independ- ence, we sought to shed light on the appropriateness of
the assumption by evaluating it in terms of word co-
occurrence in disease definitions
Since the previously described procedure had given us a means of selecting medical terms from commen English words, it was possible to produce lists of 'pure' medical terms We then wrote a program which formed all pairs of such terms (ignoring order) We defined an
‘association measure’ (A) which measured the difference between the observed co-occurrences of term-pairs (they could co-occur in any location in the definition and in either order), and the co-occurrences expected from chance alone Tables 4 and 5 show the top and bottom of
a list of all pairs formed from the low entropy terms in the previous experiment The first 1120 terms were chosen, that is, those having an entropy of 2.0 napiers
ez less The pair list was then sorted by this associa~ tion measure, A
Word pairs which are found to be highly associated, appear to do so for two reasons The test, which is trivial, is that some word pairs are semantically one word despite their being lexically, two Common examples would be ‘White House’ and ‘Hong Kong'; medical examples are ‘vital capacity', 'axis deviation', and
‘slit lamp’ These could have been avoided algorithmic- ally by not taking adjacent words in forming the term- pairs, without any significant overall effect The second reasons for high frequency word co-occurrence is that both words are causally related through underlying physiological mechanisms It is these which had the greatest interest for us, and the measure A, may be viewed as a measure of the non-independence of the symp- toms or signs themselves
The term pairs which are negatively associated, have this property for the same reason If the two terms are used typically in the descriptions of different diseases, they are less likely to co-cccur than by chance (Ina baseball story on the sports page, we would not find
*‘pass', ‘punt’, or ‘tackle'} These negatively assoc- tated pairs may have value in diagnostic programs for the recognition of two or more diseases in a given patient, a problem not satisfactorily dealt with by even the most sophisticated of current programs
Finally, an extension of the entropy concept permits one
to generate (algorithmically) the vocabularies used by the medical specialties (which correspond to the disease categories represented in CMIT This is done by assign-~ ing terms which occur predominantly in one category to a single vocabulary and then sorting by entropy Tables 6 and 7 show the vocabularies used in dermatology and gas~ troenterology (as derived from CMIT) These vocabular- ies, it will be noted, can be used as ‘hit lists' for the purpose of recognizing the content of medical texts
In summary, we see the ability to differentiate medical terms from common words by context, and the ability to relate the medical words by meaning, as two of the first steps toward text processing algorithms that preserve and can manipulate the semantic content of words in med- ical texts
Trang 2AT FEVER, MOUNTAIN: FEVER, MOUNTAIN TICK
ET VIRUS TRANSMITTED BY TICK DERMACENTOR
ANDERSONI
5M CHILLS: HEADACHE; PHOTOPHOBIA; BACK-
ACHE; PAIN IN EYE; MYALGIA; ANOREXIA;
NAUSEA: VOMITING; PROSTRATION
SG SEASONAL, MARCH TO JULY, IN WESTERN
UNITED STATES; INCUBATION PERIOD 4-6
DAYS; ONSET ABRUPT; POSSIBLY SLIGHT
ERYTHEMA; SUSTAINED FEVER, 102-104 F
OR HIGHER SIGNIFICANT; PULSE RATE
INCREASED COURSE: IN PREVENTION,
REMOVAL OF TICK FROM SKIN; APPLICA-
TIONS TO SKIN OF TURPENTINE, IODINE,
ACETONE; REMOVAL OF TICK BY INSERTION
OF NEEDLE BETWEEN MOUTH PARTS; ASPIRIN
FOR PAIN; ANTIBIOTIC TREATMENT IN-
EFFECTIVE
cM ENCEPHALITIS, MENINGITIS ESPECIALLY
IN CHILDREN
LB WBC DECREASED: MONOCYTOSIS; COMPLE=
MENT=FIXATION TEST POSITIVE; INJECT-
TION OF SERUM OR CSF KILLING SUCKLING
MICE; NEUTRALIZATION OF VIRUS WITH
IMMUNE SERUM RESULTING IN SURVIVAL
Figure 1 Typical disease 'definition'
taken from CMIT
possibly 489 severe 368 years
usually 443 chronic 340 weakness
on 434 treatment 337 inflammation
infection 431 later 336 age
features 426 absent 338 within
at 421 asymptomatic 331 lower
associ ated 415 rarely 327 necrosis
increased 414 hereditary 325 pos ‘ive
blood 398 abdominal 316 whe
fever 75 involver ant 381 congenital
loss 369 especially 361 enlargement
Table 1 The highest frequency words
used in CMIT, together with the number
of disease definitions in which the
word occurs at least once
eV} 606 602) 02 VÚI V02 VỐI VỚI UY V01 76 tri n 52 -OL V.ỐI V0 6.02 «03 0) VỤ) VI 7 01 03 ueteke - 2Ô c0} VỐL V0 07 Ơ) 03) UL 0 1 603) «602 loding 17
.Ð3 V00 U2 V03 HH li VỚI Od) I2 U2 Ok ạcg LÔ
VI VŨ V01 VÌ) 606) 608 Ot) UI 65) £02) VŨ) prenchowcopy 26
OF U2 VÂN ,02 U 07 VÉ ỐI O3 602 72 cảtazactC 53
Ol U2 VŨ) Ũ) ỐI ÔA = 602 72 VÚ3 ,OI 03 uEethrali &&
Ol Ô3 02G V03 ÚZ Uk Ú02 71 03 V02 VŨ) urethrs 5ð
e030 Ol) ,U2 OA VỦA V02 II UY ỐI VŨ) cervix bé
VŨ VŨI Ú2 V02 0A ỐI VỐI OL 6.06) VỐN 669 viên L2 -02 U13 VŨ Ũ) VÔ Oe 02 Ol 03 Od TE dneraocular 21 H/Ẫ0C OF V02 05 2.02 0A V02 £70) 0) 102 ,Ú3 pyuria 2
oOf 6.0L) 402) 4.05) oot 603) VỐI OE 60? ,IÁO 20) ventticl+e 9$
«G1 006 602) OS) p06 VỐOA VỚI 03) Oo 2 sdenơms 2l
-05 V06 O3 1.02 ỦA 08 02 ,ỐI 03 02 Ũ1 splemecteny 27 j2 02 02) 0H U) V6 UI V027 07 03 03 targear LÍ
«02 2.02 1.02.03) 65 i2 oUF Ud) 602 03 valve 35
„AI 60S) 602) 6030 U2 0A 02 VỐI ,IÍ 02 b5 deen = 20
«03.02.03 Vi 02 106 VÚA OE we 602) ,Ô) peenmocnoreax 28
Table 2 The lowest ‘entropy’ words
in CMIT, in order of increasing ‘entropy’ The entropy is given in the first column; the entries in the next 11 columns are
the percent of occurrences in the ll
disease categories (body as a whole, skin, musculo-skeletal, respiratory, Cardiovas~ cular, hemic and lymphatic, GI, GU, endo- crine, nervous, organs of special sense)
Trang 3¿.4b20 ,ia vỦ0 UB UY
2.5050 6 9 612 ud ,U
2.3655 ,.35 U9 ,LÌ ,07
2.5047 ,L2 ,ỦU ,00 WU?
2.3640 lu 210 11d ,09
2.3642 08 ,L1 ,12 ,UW
2.3647 ,I1 ,Ub5 „|! 0á
2.36353 U7 UW 121 05
2.3600 60% 106 ,UB WUT
2.3668 07 U9 ,Ủb 06
È.3687 09 .09 09 v10
2.3701 03 07 13 ,U9
2.3708 10 u6 10 LÔ
2.171) 406 06 12 12
2.3748 wt 09 ,07 0H
2.3776 UF II lá 208
1.3793 U6 ,Ú9 U§ ,Ú?
2.3796 ,06 08 UB ,UB
2.3794 ,09 0H 08 1Ú
2.5801 U9 ,ñ9 ,I10 UT
1.3815 ,06 L0 ,0R il
2.3621 ,Ú7 ,IU ,I0 sắt
2.3647 06 wil «Lh wud
2.3055 tl 09 09 09
2.3646 ,U9 609 ,0H 0U
2.4899 08 IL ,I0 ,.ŨE
2.3902 09- 09 09 08
2.1936 07 10 LD ,UB
2.3950 08 10 10 U9
2.3955 606 210 ,lU 09
Table 3 The
CMIT
words
A M1 P11 Uo up
U.952U 43 9b (23, 0)
U.967L 24 ,9b (24 , 0}
0/9470 21 296 (21, 0)
U.9521 27 ,9 (21 , U)
0.9279 l6 ⁄4 (l6, 0}
U,926? 16 ,93& (16, 0)
ÚW,9247 21 96 (2L, 0}
0.9191 11 ,92 (11, U)
0.3146 34 9á (32, 1}
U.9i26 lo 3& (là, Ú)
U.9061 19 9% (19, Ủ)
0.9030 29 34 (2H, Ú)
O.9usS 13.92 (22, 0)
0.9032 11 9L (2U , Ở}
U,H9f4 áp 692 (4S, 8}
0.0965 21 91 (20, 0}
U.B95% d .9U (8, U}
0.8956 8 90 (8, 4)
0,496 12 293 (1d , 0)
U.8912 JU ,Jé (29, 1)
0.8906 9 .WL (9, 0}
0.4dbB9L.- k6 ,92 (1 , 1)
U.d69l il ,92 (l1, U}
U.dB4@ (1 93 (13, 9)
0.884) 21 92 (22, 0)
U.H877 db 8 (16, 2)
G.dale 29 9U (37, 0)
0.8867 21 VWU (27, 0)
0.0866 29 97 (2W , 2)
0.8433 10 92 (10, 0
0.882Đ 55 ,31 (5L, 1)
U-06 7 M9 (7? , 0)
O.4802 J4 ,32 (22, 1)
0.4766 ll 92 (11, 0}
0.8733 # ,9U (4, 0}
Table 4
,UÄ ,09 ,0U ñ9_ ,0wW „10
„VŨ? ,8 10
highest '
#1
Ul
su)
Ud -ul
Oe U3
„83
02 +2
„út +40 a3 +95 U2
«Uj
Note that these
(1031 (1a) (103)
{150}
(103) (103)
-UỦb ,l3 ,11 0H bhogery a
10 LD „IÍ 20?) common 422
«lu ,06 ,UW ,UŠ marked 39
-lG Ú7 11 Jil absence 447
oll 433) 07) wth simple 46 c0? 414 608 10% 2 130
+06 ,1? 09 07 severe 489
OF tl 09 09 lace 125
;i2 ,09 ,nW 05 af 332
-lÍ ub ,l| ,ỦÁ mọạc 478
209 410) 1L 0? and 603
DU ,U3 113 ,08 cases 260
etl 98 10 210 usually 1379 -0F 09 l] «09 general 7ñ -Q? 08 09 12 as 9HO -10 ,Ú7 ,09 12 of 3206 sli ,07 ,0M lIL from 989
«0# oll O07) 210 after 538
„1U JO?) 093 ,12 sarly Jat
„1U ,09 ,08 LE by LeU8
entropy’ words in are common English
UL (23) vana-cava
„0L (2U) jonalacton-nenuf acture
«UL (21) salinercatharsis eUL (27) inhalation=macter
-00 (12) ÍtA€tuca—=comaminut ed +01 (30) tcz~Leade
Ou (1k) nasai-rhinoscopy
«Ub (23) inheiat lon~percutaneous
«Ul (34) “ir-ppm
The top of the word-pair list in
decreasing order of association value (A)
151
~-Ö0.0974k á9 U02 (UỤ, 5} „12
“0.0891 Je ,03 (0, á} „L2
Table 5 The bottom words
ud
(381) (Mil) (381) (361) (jot) (Jal) (381) (281) (381) (281) (34L) (381?
{3H1)
{31441)
(381) (341) (381) (381) (3ảL}
(38) G41) (381) (381) (381) (381) (Jbl) (341) (381)
(3811
(281) (381) (381) (Jol) C381}
Pj
„04
ad viỆ
UZ
U2
oud
22
oud
aude
ui
«03 ÚI
od
Gl
aud
.U2
„Ji
„01 U2 aul
Ud
«Ui
oUL
bd
„03
ed +01
„0i
uz
„U2
«12
04 oGL
„it
„0t
uj
(110)
(1)
(150) (bá) (511 (30)
(50)
(44)
(44)
(47)
(93)
(41) (170) (&0)
(uU)
(745) (36)
(35)
(u4) (60) (35)
(52) (5e) (05)
(64)
(97)
(31L?
($3) (od) (129)
(32) (95) (30)
(bá) (3U)
(50)
(30)
(ol)
(29) (29)
{bu}
(29)
(29) (5a)
(23)
(25)
wie
poueeventricular
bone-vacinal
bone=«cp
bonda~csfv1K bong~srrLcture
nona=ttris
bane=paroxyse+L bona=carchneterLzáL toa
b°ng=rhychnm bonsa¬^sgLaucona bane=n
bone—wave uyspnea“epidgarnis bone-qrs
boneer ight ponw=scerility
aySpnes-nerves
dyapnea~-scalp cone~tisias
dyspnea=urechral
vone-corive
dyspnea-gait
bone=cillecy oone=pulsonic dyspnea-hyperkeratouis dyspnea~koseg
bone-atrial
ouneeurethral
bone-perineus
bone-ovary
uyspnea-cystoscopy
dyspneacdiak
dyupasa-nyscapmue
bonn=srtery
dyspnea=,enitalts bana=ventricla
bone-engtocardiagrapny
pone—con junccive
bone-leads
bong=exertional
dyspnua=penis dyaunaea~nehnavLor
bọne~>dLapnrage
bone=ual lap
bona=pup11
bone-gallbladder bone=dysarthria
bone-abortion
bone-urectnra
bane=con1unacc1vak bone=fieLd oone-eavironaent
of the word=pair list, showing the negatively correlating
Trang 41.9926
2.0008
2.0032
Table 6
vocabulary
papules 76
acanthosis = 44
nyperkeratosig 56
macuiea 11
involution 22
senaceaus horny 21 keratin 1%
stratus 21
pruritus 185
soles 40
itches
pa’
crust 16 keratosis 17
Circusacribed 65 crusting 27
meat | 15
leaving 4 Plaques 57 sunlight 25 verrucous 14 nail +2 scaly 22 ridges 25 hyperkaratoLic 17
A word list generated algorith-
mically which constitutes a dermatological
The disease category ‘skin’
is represented by the third column
Table 7 A word list generated algorith- mically which constitutes a vocabulary of gastroenterology The eighth column represents the disease category ‘digestive system’,