American Sign Language Generation: Multimodal NLG with Multiple Linguistic Channels Matt Huenerfauth Computer and Information Science University of Pennsylvania Philadelphia, PA 19104
Trang 1American Sign Language Generation:
Multimodal NLG with Multiple Linguistic Channels
Matt Huenerfauth
Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 USA
matt@huenerfauth.com
Abstract
Software to translate English text into
American Sign Language (ASL) animation
can improve information accessibility for
the majority of deaf adults with limited
English literacy ASL natural language
generation (NLG) is a special form of
mul-timodal NLG that uses multiple linguistic
output channels ASL NLG technology has
applications for the generation of gesture
animation and other communication signals
that are not easily encoded as text strings
American Sign Language (ASL) is a full natural
language – with a linguistic structure distinct from
English – used as the primary means of
communi-cation for approximately one half million deaf
people in the United States (Neidle et al., 2000,
Liddell, 2003; Mitchell, 2004) Without aural
ex-posure to English during childhood, a majority of
deaf U.S high school graduates (age 18) have only
a fourth-grade (age 10) English reading level (Holt,
1991) Technology for the deaf rarely addresses
this literacy issue; so, many deaf people find it
dif-ficult to read text on electronic devices Software
for translating English text into animations of a
computer-generated character performing ASL can
make a variety of English text sources accessible to
the deaf, including: TV closed captioning, teletype
telephones, and computer user-interfaces
(Huener-fauth, 2005) Machine translation (MT) can also
be used in educational software for deaf children to
help them improve their English literacy skills
This paper describes the design of our English-to-ASL MT system (Huenerfauth, 2004a, 2004b, 2005), focusing on ASL generation This overview illustrates important correspondences between the problem of ASL natural language generation (NLG) and related research in Multimodal NLG
1.1 ASL Linguistic Issues
In ASL, several parts of the body convey meaning
in parallel: hands (location, orientation, shape), eye gaze, mouth shape, facial expression, head-tilt, and shoulder-tilt Signers may also interleave lexical signing (LS) with classifier predicates (CP) during
a performance During LS, a signer builds ASL sentences by syntactically combining ASL lexical items (arranging individual signs into sentences) The signer may also associate entities under dis-cussion with locations in space around their body; these locations are used in pronominal reference (pointing to a location) or verb agreement (aiming the motion path of a verb sign to/from a location) During CPs, signers’ hands draw a 3D scene in the space in front of their torso One could imag-ine invisible placeholders floating in front of a signer representing real-world objects in a scene
To represent each object, the signer places his/her hand in a special handshape (used specifically for objects of that semantic type: moving vehicles, seated animals, upright humans, etc.) The hand is moved to show a 3D location, movement path, or surface contour of the object being described For example, to convey the English sentence “the car parked next to the house,” signers would indicate a location in space to represent the house using a special handshape for ‘bulky objects.’ Next, they would use a ‘moving vehicle’ handshape to trace a 3D path for the car which stops next to the house
37
Trang 21.2 Previous ASL MT Systems
There have been some previous English-to-ASL
MT projects – see survey in (Huenerfauth, 2003)
Amid other limitations, none of these systems
ad-dress how to produce the 3D locations and motion
paths needed for CPs A fluent, useful
English-to-ASL MT system cannot ignore CPs English-to-ASL
sign-frequency studies show that signers produce a CP
from 1 to 17 times per minute, depending on genre
(Morford and MacFarlane, 2003) Further, it is
those English sentences whose ASL translation
uses a CP that a deaf user with low English literacy
would need an MT system to translate These
Eng-lish sentences look structurally different than their
ASL CP counterpart – often making the English
sentence difficult to read for a deaf user
NLG researchers think of communication signals
in a variety of ways: some as a written text, other
as speech audio (with prosody, timing, volume,
and intonation), and those working in Multimodal
NLG as text/speech with coordinated graphics
(maps, charts, diagrams, etc) Some Multimodal
NLG focuses on “embodied conversational agents”
(ECAs), computer-generated animated characters
that communicate with users using speech, eye
gaze, facial expression, body posture, and gestures
(Cassell et al., 2000; Kopp et al., 2004)
The output of any NLG system could be
repre-sented as a stream of values (or features) that
change over time during a communication signal;
some NLG systems specify more values than
oth-ers Because the English writing system does not
record a speaker’s prosody, facial expression or
gesture1, a text-based NLG system specifies fewer
communication stream values in its output than
does a speech-based or ECA system A text-based
NLG system requires literate users, to whom it can
transfer some of the processing burden; they must
mentally reconstruct more of the language
per-formance than do users of speech or ECA systems
Since most writing systems are based on strings,
text-based NLG systems can easily encode their
output as a single stream, namely a sequence of
1
Some punctuation marks loosely correspond to intonation or
pauses, but most prosodic information is lost Facial
expres-sion and gesture is generally not conveyed in writing, except
perhaps for the occasional use of “emoticons.” ;-)
words/characters To generate more complex sig-nals, multimodal systems decompose their output into several sub-streams – we’ll refer to these as
“channels.” Dividing a communication signal into channels can make it easier to represent the various choices the generator must make; generally, a dif-ferent processing component of the system will govern the output of each channel The trade-off is that these channels must be coordinated over time Instead of thinking of channels as dividing a communication signal, we can think of them as groupings of individual values in the data stream that are related in some way The channels of a multimodal NLG system generally correspond to natural perceptual/conceptual groupings called
“modalities.” Coarsely, audio and visual parts of the output are thought of as separate modalities When parts of the output appear on different por-tions of the display, then they are also generally considered separate modalities For instance, a multimodal NLG system for automobile driving directions may have separate processing channels for text, maps, other graphics, and sound effects
An ECA system may have separate channels for eye gaze, facial expression, manual gestures, and speech audio of the animated character
When a language has no commonly-known writ-ing system – as is the case for ASL – then it’s not possible to build a text-based NLG system We must produce an animation of a character (like an ECA) performing ASL; so, we must specify how the hands, eye gaze, mouth shape, facial expres-sion, head-tilt, and shoulder-tilt are coordinated over time With no conventional string-encoding
of ASL (that would compress the signal into a sin-gle stream), an ASL signal is spread over multiple channels of the output – a departure from most Multimodal NLG systems, which have a single linguistic channel/modality that is coordinated with other non-linguistic resources (Figure 1)
Figure 1: Linguistic Channels in Multimodal Systems
English Text Driving Maps Other Graphics
Prototypical Driving- Direction System
Sound Effects
Left Hand Head-Tilt Eye-Gaze Facial Expression
Right Hand
Prototypical ASL System
Trang 3Of course, we could invent a string-based
nota-tion for ASL so that we could use tradinota-tional
text-based NLG technology (Since ASL has no
writ-ing system, we would have to invent an artificial
notation.) Unfortunately, since the users of the
system wouldn’t be trained in this new writing
sys-tem, it could not be used as output; we would still
need to generate a multimodal animation output
An artificial writing system could only be used for
internal representation and processing, However,
flattening a naturally multichannel signal into a
single-channel string (prior to generating a
mul-tichannel output) can introduce its own
complica-tions to the ASL system’s design For this reason,
this project has been exploring ways to represent
the hierarchical linguistic structure of information
on multiple channels of ASL performance (and
how these structures are coordinated or
uncoordi-nated across channels over time)
Some multimodal systems have explored using
linguistic structures to control (to some degree) the
output of multiple channels Research on
generat-ing animations of a speakgenerat-ing ECA character that
performs meaningful gestures (Kopp et al., 2004)
has similarities to this ASL project First of all, the
channels in the signal are basically the same; an
animated human-like character is shown onscreen
with information about eye, face, and arm
move-ments being generated However, an ASL system
has no audio speech channel but potentially more
fine-grained channels of detailed body movement
The less superficial similarity is that (Kopp et
al, 2004) have attempted to represent the semantic
meaning of some of the character’s gestures and to
synchronize them with the speech output This
means that, like an ASL NLG system, several
channels of the signal are being governed by the
linguistic mechanisms of a natural language
Unlike ASL, the gesture system uses the speech
audio channel to convey nearly all of the meaning
to the user; the other channels are generally used to
convey additional/redundant information Further,
the internal structure of the gestures is not
gener-ally encoded in the system; they are typicgener-ally
atomic/lexical gesture events which are
synchro-nized to co-occur with portions of speech output
A final difference is that gestures which co-occur
with English speech (although meaningful) can be
somewhat vague and are certainly less systematic
and conventional than ASL body movements So,
while both systems may have multiple linguistic
channels, the gesture system still has one primary linguistic channel (audio speech) and a few chan-nels controlled in only a partially linguistic way
The linguistic and multimodal issues discussed above have had important consequences on the design of our English-to-ASL MT system There are several unique features of this system caused by: (1) ASL having multiple linguistic channels that must be coordinated during generation, (2) ASL having both an LS and a CP form of signing, (3) CP signing visually conveying 3D spatial rela-tionships in front of the signer’s torso, and (4) ASL lacking a conventional written form While ASL-particular factors influenced this design, section 5 will discuss how this design has implications for NLG of traditional written/spoken languages
3.1 Coordinating Linguistic Channels
Section 2 mentioned that this project is developing multichannel (non-string) encodings of ASL ani-mation; these encodings must coordinate multiple channels of the signal as they are generated by the linguistic structures and rules of ASL Kopp et al (2004) have explored how to coordinate meaning-ful gestures with speech signal during generation; however, their domain is somewhat simpler Their gestures are atomic events without internal hierar-chical structure Our project is currently develop-ing grammar-like coordination formalisms that allow complex linguistic signals on multiple chan-nels to be conveniently represented.2
3.2 ASL Computational Linguistic Models
This project uses representations of discourse, se-mantics, syntax, and (sign) phonology tailored to ASL generation (Huenerfauth, 2004b) In particu-lar, since this MT system will generate animations
of classifier predicates (CPs), the system consults a 3D model of real-world scenes under discussion Further, since multimodal NLG requires a form of scheduling (events on multiple channels are coor-dinated over a performance timeline), all of the linguistic models consulted and modified during ASL generation are time-indexed according to a timeline of the ASL performance being produced
2
Details of this work will be described in future publication
Trang 4Previous ASL phonological models were
de-signed to represent non-CP ASL, but CPs use a
reduced set of handshapes, standard eye-gaze and
head-tilt patterns, and more complex orientations
and motion paths The phonological model
devel-oped for this system makes it easier to specify CPs
Because ASL signers can use the space in front
of their body to visually convey information, it is
possible during CPs to show the exact 3D layout of
objects being discussed (The use of channels
rep-resenting the hands means that we can now
indi-cate 3D visual information – not possible with
speech or text.) To represent this 3D detailed form
of meaning, this system has an unusual semantic
model for generating CPs We populate the
vol-ume of space around the signer’s torso with
invisi-ble 3D objects representing entities discussed by
CPs being generated (Huenerfauth, 2004b) The
semantic model is the set of placeholders around
the signer (augmented with the CP handshape used
for each) Thus, the semantics of the “car parked
next to the house” example (section 1.1) is that a
‘bulky’ object occupies a particular 3D location
and a ‘vehicle’ object moves toward it and stops
Of course, the system will also need more
tradi-tional semantic representations of the information
to be conveyed during generation, but this 3D
model helps the system select the proper 3D
mo-tion paths for the signers’ hands when “drawing”
the 3D scenes during CPs The work of (Kopp et
al., 2004) studies gestures to convey spatial
infor-mation during an English speech performance, but
unlike this system, they use a
logical-predicate-based semantics to represent information about
objects referred to by gesture Because ASL CPs
indicate 3D layout in a linguistically conventional
and detailed way, we use an actual 3D model of
the objects being discussed Such a 3D model may
also be useful for ECA systems that wish to
gener-ate more detailed 3D spatial gesture animations
The discourse model in this ASL system records
features not found in other NLG systems It tracks
whether a 3D location has been assigned to each
discourse entity, where that location is around the
signer, and whether the latest location of the entity
has been indicated by a CP The discourse model
is not only relevant during CP performance; since
ASL LS performance also assigns 3D locations to
objects under discussion (for pronouns and verbal
agreement), this model is also used for LS
3.3 Generating 3D Classifier Predicates
An essential step in producing an animation of an ASL CP is the selection of 3D motion paths for the computer-generated signer’s hands, eye gaze, and head tilt The motion paths of objects in the 3D model described above are used to select corre-sponding motion paths for these parts of the signer’s body during CPs To build the 3D place-holder model, this system uses preexisting scene-visualization software to analyze an English text describing the motion of real-world objects and build a 3D model of how the objects mentioned in text are arranged and move (Huenerfauth, 2004b) This model is “overlaid” onto the volume in front
of the ASL signer (Figure 2) For each object in the scene, a corresponding invisible placeholder is positioned in front of the signer; the layout of placeholders mimics the layout of objects in the 3D scene In the “car parked next to the house” exam-ple, a miniature invisible object representing a
‘house’ is positioned in front of the signer’s torso, and another object (with a motion path terminating next to the ‘house’) is added to represent the ‘car.’ The locations and orientations of the placehold-ers are later used by the system to select the loca-tions and orientaloca-tions for the signer’s hands while performing CPs about them So, the motion path calculated for the car will be the basis for the 3D motion path of the signer’s hand during the classi-fier predicate describing the car’s motion Given the information in the discourse/semantic models, the system generates the hand motions, head-tilt, and eye-gaze for a CP It stores a library contain-ing templates representcontain-ing a prototypical form of each CP the system can produce The templates
TEXT:
THE CAR PARKED NEXT
TO THE HOUSE.
Visualization Software
3D MODEL:
Overlay in front of ASL signer Convert to 3D
placeholder locations/paths
Figure 2: Converting English Text to 3D Placeholder
Trang 5are planning operators (with logical pre-conditions,
monitored termination conditions, and effects),
allowing the system to “trigger” other elements of
ASL signing performance that may be required
during a CP A planning-based NLG approach,
described in (Huenerfauth, 2004b), is used to select
a template, fill in its missing parameters, and build
a schedule of the animation events on multiple
channels needed to produce a sequence of CPs
3.4 A Multi-Path Architecture
A multimodal NLG system may have several
pres-entation styles it could use to convey information
to its user; these styles may take advantage of the
various output channels to different degrees In
ASL, there are multiple channels in the linguistic
portion of the signal, and not surprisingly, the
lan-guage has multiple sub-systems of signing that
take advantage of the visual modality in different
ways ASL signers can select whether to convey
information using lexical signing (LS) or classifier
predicates (CPs) during an ASL performance
(sec-tion 1.1) These two sub-systems use the space
around the signer differently; during CPs, locations
in space associated with objects under discussion
must be laid out in a 3D manner corresponding to
the topological layout of the real-world scene
un-der discussion Locations associated with objects
during LS (used for pronouns and verb agreement)
have no topological requirement The layout of the
3D locations during LS may be arbitrary
The CP generation approach in section 3.3 is
computationally expensive; so, we would only like
to use this processing pathway when necessary
English input sentences not producing classifier
predicates would not need to be processed by the
visualization software; in fact, most of these
sen-tences could be handled using the more traditional
MT technologies of previous systems For this
reason, our English-to-ASL MT system has
multi-ple processing pathways (Huenerfauth, 2004a)
The pathway for handling English input sentences
that produce CPs includes the scene visualization
software, while other input sentences undergo less
sophisticated processing using a traditional MT
approach (that is easier to implement) In this way,
our CP generation component can actually be
lay-ered on top of a pre-existing English-to-ASL MT
system to give it the ability to produce CPs This
multi-path design is equally applicable to the
archi-tecture of written-language MT systems The de-sign allows an MT system to combine a resource-intensive deep-processing MT method for difficult (or important) inputs and a resource-light broad-coverage MT method for other inputs
3.5 Evaluation of Multichannel NLG
The lack of an ASL writing system and the mul-tichannel nature of ASL can make NLG or MT systems which produce ASL animation output dif-ficult to evaluate using traditional automatic tech-niques Many such approaches compare a string produced by a system to some human-produced
‘gold-standard’ string While we could invent an artificial ASL writing system for the system to produce as output, it’s not clear that human ASL signers could accurately or consistently produce written forms of ASL sentences to serve as ‘gold standards’ for such an evaluation And of course, real users of the system would never be shown arti-ficial “written ASL”; they would see full anima-tions instead User-based studies (where ASL signers evaluate animation output directly) may be
a more meaningful measure of an ASL system
We are planning such an evaluation of a proto-type CP-generation module of the system during the summer/fall of 2005 Members of the deaf community who are native ASL signers will view animations of classifier predicates produced by the system As a control, they will also be shown an-imations of CPs produced using 3D motion capture technology to digitally record the performance of CPs by other native ASL signers Their evaluation
of animations from both sources will be compared
to measure the system’s performance The mul-tichannel nature of the signal also makes other in-teresting experiments possible To study the system’s ability to animate the signer’s hands only, motion-captured ASL could be used to animate the head/body of the animated character, and the NLG system can be used to control only the hands of the character Thus, channels of the NLG system can
be isolated for evaluation – an experimental design only available to a multichannel NLG system
The design portion of this English-to-ASL project
is nearly complete, and the implementation of the system is ongoing Evaluations of the system will
Trang 6be available after the user-based study discussed
above; however, the design itself has highlighted
interesting issues about the requirements of NLG
software for sign languages like ASL
The multichannel nature of ASL has led this
project to study mechanisms for coordinating the
values of the linguistic models used during
genera-tion (including the output animagenera-tion specificagenera-tion
itself) The need to handle both the LS and CP
subsystems of the language has motivated: a
multi-path MT architecture, a discourse model that stores
data relevant to both subsystems, a model of the
space around the signer capable of storing both LS
and CP placeholders, and a phonological model
whose values can be specified by either subsystem
Since this English-to-ASL MT system is the first
to address ASL classifier predicates, designing an
NLG process capable of producing the 3D
loca-tions and paths in a CP animation has been a major
design focus for this project These issues have
been addressed by the system’s use of a 3D model
of placeholders produced by scene-visualization
software and a planning-based NLG process
oper-ating on templates of prototypical CP performance
Sign language NLG requires 3D spatial
representa-tions and multichannel coordinated output, but it’s
not unique in this requirement In fact, generation
of a communication signal for any language may
require these capabilities (even for spoken
lan-guages like English) We have mentioned
throughout this paper how gesture/speech ECA
researchers may be interested in NLG technologies
for ASL – especially if they wish to produce
ges-tures that are more linguistically conventional,
in-ternally complex, or 3D-topologically precise
Many other computational linguistic
applica-tions could benefit from an NLG design with
mul-tiple linguistic channels (and indirectly benefit
from ASL NLG technology) For instance, NLG
systems producing speech output could encode
prosody, timing, volume, intonation, or other vocal
data as multiple linguistically-determined channels
of the output (in addition to a channel for the string
of words being generated) And so, ASL NLG
research not only has exciting accessibility benefits
for deaf users, but it also serves as a research
vehi-cle for NLG technology to produce a variety of
richer-than-text linguistic communication signals
Acknowledgments
I would like to thank my advisors Mitch Marcus and Martha Palmer for their guidance, discussion, and revisions during the preparation of this work
References
Cassell, J., Sullivan, J., Prevost, S., and Churchill, E (Eds.) 2000 Embodied Conversational Agents Cambridge, MA: MIT Press
Holt, J 1991 Demographic, Stanford Achievement Test
- 8th Edition for Deaf and Hard of Hearing Students: Reading Comprehension Subgroup Results
Huenerfauth, M 2003 Survey and Critique of ASL Natural Language Generation and Machine Transla-tion Systems Technical Report MS-CIS-03-32, Computer and Information Science, University of Pennsylvania
Huenerfauth, M 2004a A Multi-Path Architecture for Machine Translation of English Text into American Sign Language Animation In Proceedings of the Student Workshop of the Human Language Tech-nologies conference / North American chapter of the Association for Computational Linguistics annual meeting: HLT/NAACL 2004, Boston, MA, USA Huenerfauth, M 2004b Spatial and Planning Models of ASL Classifier Predicates for Machine Translation 10th International Conference on Theoretical and Methodological Issues in Machine Translation: TMI
2004, Baltimore, MD
Huenerfauth, M 2005 American Sign Language Spatial Representations for an Accessible User-Interface In
3rd International Conference on Universal Access in Human-Computer Interaction Las Vegas, NV, USA Kopp, S., Tepper, P., and Cassell, J 2004 Towards Integrated Microplanning of Language and Iconic Gesture for Multimodal Output Int’l Conference on Multimodal Interfaces, State College, PA, USA Liddell, S 2003 Grammar, Gesture, and Meaning in American Sign Language UK: Cambridge U Press Mitchell, R 2004 How many deaf people are there in the United States Gallaudet Research Institute, Grad School & Prof Progs Gallaudet U June 28, 2004 http://gri.gallaudet.edu/Demographics/deaf-US.php Morford, J., and MacFarlane, J 2003 Frequency Char-acteristics of ASL Sign Language Studies, 3:2 Neidle, C., Kegl, J., MacLaughlin, D., Bahan, B., and Lee R.G 2000 The Syntax of American Sign Lan-guage: Functional Categories and Hierarchical Struc-ture Cambridge, MA: The MIT Press