Báo cáo khoa học: "Bi-Directional Parsing for Generic Multimodal Interaction" pot

Clavius: Bi-Directional Parsing for Generic Multimodal InteractionFrank Rudzicz Centre for Intelligent Machines McGill University Montr´eal, Canada frudzi@cim.mcgill.ca Abstract We intro

Trang 1

Clavius: Bi-Directional Parsing for Generic Multimodal Interaction

Frank Rudzicz Centre for Intelligent Machines McGill University Montr´eal, Canada frudzi@cim.mcgill.ca

Abstract

We introduce a new multi-threaded

parsing algorithm on unification grammars

designed specifically for multimodal

interaction and noisy environments

By lifting some traditional constraints,

namely those related to the ordering

of constituents, we overcome several

difficulties of other systems in this

domain We also present several criteria

used in this model to constrain the search

process using dynamically loadable

scoring functions Some early analyses of

our implementation are discussed

1 Introduction

Since the seminal work of Bolt (Bolt, 1980), the

methods applied to multimodal interaction (MMI)

have diverged towards unreconcilable approaches

retrofitted to models not specifically amenable to

the problem For example, the representational

differences between neural networks, decision

trees, and finite-state machines (Johnston and

Bangalore, 2000) have limited the adoption of

the results using these models, and the typical

reliance on the use of whole unimodal sentences

defeats one of the main advantages of MMI - the

ability to constrain the search using cross-modal

information as early as possible

CLAVIUS is the result of an effort to combine

sensing technologies for several modality types,

speech and video-tracked gestures chief among

them, within the immersive virtual environment

(Boussemart, 2004) shown in Figure 1 Its purpose

is to comprehend multimodal phrases such as

“put this & here & ”, for pointing gestures &,

in either command-based or dialogue interaction

CLAVIUS provides a flexible, and trainable new bi-directional parsing algorithm on multi-dimensional input spaces, and produces modality-independent semantic interpretation with a low computational cost

Figure 1: The target immersive environment

1.1 Graphical Models and Unification Unification grammars on typed directed acyclic graphs have been explored previously in MMI, but typically extend existing mechanisms not designed for multi-dimensional input For example, both (Holzapfel et al., 2004) and (Johnston, 1998) essentially adapt Earley’s chart parser by representing edges as sets of references

to terminal input elements - unifying these as new edges are added to the agenda In practice this has led to systems that analyze every possible subset of the input resulting in a combinatorial explosion that balloons further when considering the complexities of cross-sentential phenomena such as anaphora, and the effects of noise and uncertainty on speech and gesture tracking We will later show the extent to which CLAVIUS reduces the size of the search space

85

Trang 2

Directed graphs conveniently represent

both syntactic and semantic structure, and all

partial parses in CLAVIUS , including

terminal-level input, are represented graphically Few

restrictions apply, except that arcs labelled

CAT and TIME must exist to represent the

grammar category and time spanned by the

parse, respectively1 Similarly, all grammar rules,

Γi : LHS −→ RHS1RHS2 RHSr, are

graphical structures, as exemplified in Figure 2

NP click {where(NP :: f1) = (click :: f1)}, with

NP expanded by Γ2 : NP −→ DT NN

1.2 Multimodal Bi-Directional Parsing

Our parsing strategy combines bottom-up and

top-down approaches, but differs from other

approaches to bi-directional chart parsing (Rocio,

1998) in several key respects, discussed below

1.2.1 Asynchronous Collaborating Threads

A defining characteristic of our approach is

that edges are selected asynchronously by two

concurrent processing threads, rather than serially

in a two-stage process In this way, we can

distribute processing across multiple machines,

or dynamically alter the priorities given to each

thread Generally, this allows for a more dynamic

process where no thread can dominate the other In

typical bi-directional chart parsing the top-down

component is only activated when the bottom-up

component has no more legal expansions (Ageno,

2000)

1.2.2 Unordered Constituents

Alhough evidence suggests that deictic

gestures overlap or follow corresponding spoken

pronomials 85-93% of the time (Kettebekov et al,

1

Usually this timespan corresponds to the real-time

occurrence of a speech or gestural event, but the actual

semantics are left to the application designer

2002), we must allow for all possible permutations

of multi-dimensional input - as in “put & this & here.” vs “put this & here & ”, for example

We therefore take the unconvential approach

of placing no mandatory ordering constraints on constituents, hence the rule Γabc : A −→ B C parses the input “ C B” We show how we can easily maintain regular temporal ordering in §3.5 1.2.3 Partial Qualification

Whereas existing bi-directional chart parsers maintain fully-qualified edges by incrementally adding adjacent input words to the agenda, CLAVIUS has the ability to construct parses that instantiate only a subset of their constituents,

so Γabc also parses the input “B”, for example Repercussions are discussed in §3.4 and §4

2 The Algorithm

CLAVIUS expands parses according to a best-first process where newly expanded edges are ordered according to trainable criteria of multimodal language, as discussed in §3 Figure 3 shows a component breakdown of CLAVIUS ’s software architecture The sections that follow explain the flow of information through this system from sensory input to semantic interpretation

Figure 3: Simplified information flow between fundamental software components

2.1 Lexica and Preprocessing Each unique input modality is asynchronously monitored by one of T TRACKERS, each sending

an n-best list of lexical hypotheses to CLAVIUSfor any activity as soon as it is detected For example,

a gesture tracker (see Figure 4a) parametrizes the gestures preparation, stroke/point, and retraction (McNeill, 1992), with values reflecting spatial positions and velocities of arm motion, whereas

Trang 3

our speech tracker parametrises words with

part-of-speech tags, and prior probabilities (see Figure

4b) Although preprocessing is reduced to the

identification of lexical tokens, this is more

involved than simple lexicon lookup due to the

modelling of complex signals

Figure 4: Gestural (a) and spoken (b) ‘words’

2.2 Data Structures

All TRACKERS write their hypotheses directly

to the first of three SUBSPACES that partition

all partial parses in the search space The first

is the GENERALISER’s subspace, Ξ[G], which

is monitored by the GENERALISER thread

-the first part of -the parser All new parses

are first written to Ξ[G] before being moved to

the SPECIFIER’s active and inactive subspaces,

Ξ[SAct], and Ξ[SInact], respectively Subspaces are

optimised for common operations by organising

parses by their scores and grammatical categories

into depth-balanced search trees having the heap

property The best partial parse in each subspace

can therefore be found in O(1) amortised time

2.3 Generalisation

The GENERALISER monitors the best partial

parse, Ψg, in Ξ[G], and creates new parses Ψi

for all grammar rules Γi having CATEGORY(Ψg)

on the right-hand side Effectively, these new

parses are instantiations of the relevant Γi, with

one constituent unified to Ψg This provides

the impetus towards sentence-level parses, as

simplified in Algorithm 1 and exemplified in

Figure 5 Naturally, if rule Γi has more than one

constituent (c > 1) of type CATEGORY(Ψg), then

c new parses are created, each with one of these

being instantiated

Since the GENERALISERis activated as soon as

input is added to Ξ[G], the process is interactive

(Tomita, 1985), and therefore incorporates the

associated benefits of efficiency This is contrasted

with the all-paths bottom-up strategy in GEMINI (Dowding et al, 1993) that finds all admissable edges of the grammar

Algorithm 1: Simplified Generalisation Data: Subspace Ξ[G], grammar Γ while data remains in Ξ[G]do

Ψg := highest scoring graph in Ξ[G]

foreach rule Γi s.t Cat(Ψg) ∈ RHS(Γi) do

Ψi:= Unify (Γi, [• →RHS • ⇒ Ψg])

if ∃Ψithen Apply Score(Ψi) to Ψi InsertΨiinto Ξ[G]

MoveΨginto Ξ[SAct]

Figure 5: Example of GENERALISATION

2.4 Specification The SPECIFIER thread provides the impetus towards complete coverage of the input, as simplified in Algorithm 2 (see Figure 6) It combines parses in its subspaces that have the same top-level grammar expansion but different instantiated constituents The resulting parse merges the semantics of the two original graphs only if unification succeeds, providing a hard constraint against the combination of incongruous information The result, Ψ, of specification must

be written to Ξ[G], otherwise Ψ could never appear

on the RHS of another partial parse We show how associated vulnerabilities are overcome in §3.2 and §3.4

Specification is commutative and will always provide more information than its constituent graphs if it does not fail, unlike the ‘overlay’

Trang 4

method of SMARTKOM (Alexandersson and

Becker, 2001), which basically provides a

subsumption mechanism over background

knowledge

Algorithm 2: Simplified Specification

Data: Subspaces Ξ[SAct]and Ξ[SInact]

while data remains in Ξ[SAct]do

Ψs:= highest scoring graph in Ξ[SAct]

Ψj := highest scoring graph in Ξ[SInact]

s.t Cat (Ψj) = Cat (Ψs)

while ∃Ψj do

Ψi:= Unify (Ψs, Ψj)

if ∃Ψithen

Apply Score(Ψi) to Ψi

InsertΨiinto Ξ[G]

Ψj := next highest scoring graph from

Ξ[SInact]s.t Cat (Ψj) = Cat (Ψs)

; // Optionally stop after I

iterations, for some I

MoveΨsinto Ξ[SInact]

Figure 6: Example of SPECIFICATION

2.5 Cognition

The COGNITION thread monitors the best

sentence-level hypothesis, ΨB, in Ξ[SInact],

and terminates the search process once ΨB has

remained unchallenged by new competing parses

for some period of time

Once found, COGNITIONcommunicates ΨBto

the APPLICATION. Both COGNITION and the

APPLICATION read state information from the

MySQL WORLD database, as discussed in §3.5,

though only the latter can modify it

3 Applying Domain-Centric Knowledge

Upon being created, all partial parses are assigned

a score approximating its likelihood of being part

of an accepted multimodal sentence The score

of partial parse Ψ, SCORE(Ψ) =

|S|

X

i=0

ωiκi(Ψ),

is a weighted linear combination of independent scoring modules (KNOWLEDGESOURCES) Each module presents a score function κi : Ψ → <[0 1] according to a unique criterion of multimodal language, weighted by ωi, also on <[0 1] Some modules provide ‘hard constraints‘ that can outright forbid unification, returning κi = −∞

in those cases A subset of the criteria we have explored are outlined below

3.1 Temporal Alignment (κ1)

By modelling the timespans of parses as Gaussians, where µ and σ are determined by the midpoint and 12 the distance between the two endpoints, respectively - we can promote parses whose constituents are closely related in time with the symmetric Kullback-Leibler divergence,

DKL(Ψ1, Ψ2) = (σ2−σ2)2+((µ1 −µ2)(σ 2 +σ 2 )) 2

Therefore, κ1 promotes more locally-structured parses, and co-occuring multimodal utterances 3.2 Ancestry Constraint (κ2)

A consequence of accepting n-best lexical hypotheses for each word is that we risk unifying parses that include two competing hypotheses For example, if our speech TRACKER produces hypotheses “horse” and “house” for ambiguous input, then κ2 explicitly prohibits the parse “the horse and the house” with flags on lexical content 3.3 Probabilistic Grammars (κ3)

We emphasise more common grammatical constructions by augmenting each grammar rule with an associated probability, P (Γi), and assigning κ3(Ψ) = P (RULE(Ψ)) · Y

Ψ c =constituent of Ψ

κ3(Ψc) where RULE is the top-level expansion of Ψ

Probabilities are trainable by maximum likelihood estimation on annotated data Within the context of CLAVIUS , κ3 promotes the processing of new input words and shallower parse trees

Trang 5

3.4 Information Content (κ4), Coverage (κ5)

The κ4 module partially orders parses by

preferring those that maximise the joint entropy

between the semantic variables of its constituent

parses Furthermore, we use a shifted sigmoid

1+e− 25NUMWORDSIN(Ψ)

−1, to promote parses that maximise the number of ‘words’ in a parse

These two modules together are vital in choosing

fully specified sentences

3.5 Functional Constraints (κ6)

Each grammar rule Γi can include constraint

functions f : Ψ → <[0,1] parametrised by values

in instantiated graphs For example, the function

T FOLLOWS(Ψ 1, Ψ2) returns 1 if constituent Ψ2

follows Ψ1 in time, and −∞ otherwise, thus

maintaining ordering constraints Functions are

dynamically loaded and executed during scoring

Since functions are embedded directly within

parse graphs, their return values can be directly

incorporated into those parses, allowing us to

utilise data in the WORLD For example, the

function OBJECTAT(x, y, &o) determines if an

object exists at point (x, y), as determined by a

pointing gesture, and writes the type of this object,

o, to the graph, which can later further constrain

the search

4 Early Results

We have constructed a simple blocks-world

experiment where a user can move, colour,

create, and delete geometric objects using speech

and pointing gestures with 74 grammar rules,

25 grammatical categories, and a 43-word

vocabulary Ten users were recorded interacting

with this system, for a combined total of 2.5

hours of speech and gesture data, and 2304

multimodal utterances Our randomised data

collection mechanism was designed to equitably

explore the four command types Test subjects

were given no indication as to the types of phrases

we expected - but were rather shown a collection

of objects and were asked to replicate it, given the

four basic types of actions

Several aspects of the parser have been tested at

this stage and are summarised below

4.1 Accuracy

Table 1 shows three hand-tuned configurations of

the module weights ωi, with ω2 = 0.0, since κ2

provides a ‘hard constraint’ (§3.2)

Figure 7 shows sentence-level precision achieved for each Ωi on each of the four tasks, where precision is defined as the proportion of correctly executed sentences These are compared against the CMU Sphinx-4 speech recogniser using the unimodal projection of the multimodal grammar Here, conjunctive phrases such as “Put

a sphere here and colour it yellow”are classified according to their first clause

Presently, correlating the coverage and probabilistic grammar constraints with higher weights ( > 30%) appears to provide the best results Creation and colouring tasks appeared

to suffer most due to missing or misunderstood head-noun modifiers (ie., object colour) In these examples, CLAVIUS ranged from a −51.7% to a 62.5% relative error reduction rate over all tasks

Table 1: Three weight configurations

Figure 7: Precision across the test tasks

4.2 Work Expenditure

To test whether the best-first approach compensates for CLAVIUS ’ looser constraints (§1.2), a simple bottom-up multichart parser (§1.1) was constructed and the average number

of edges it produces on sentences of varying length was measured Figure 8 compares this against the average number of edges produced

by CLAVIUS on the same data In particular, although CLAVIUSgenerally finds the parse it will accept relatively quickly (‘CLAVIUS - found’), the COGNITIONmodule will delay its acceptance (‘CLAVIUS- accepted’) for a time Further tuning will hopefully reduce this ‘waiting period’

Trang 6

Figure 8: Number of edges expanded, given

sentence length

5 Remarks

CLAVIUS consistently ignores over 92% of

dysfluencies (eg “uh”) and significant noise

events in tracking, apparently as a result of the

partial qualifications discussed in §1.2.3, which is

especially relevant in noisy environments Early

unquantified observation also suggests that a

result of unordered constituents is that parses

incorporating lead words - head nouns, command

verbs and pointing gestures in particular - are

emphasised and form sentence-level parses early,

and are later ‘filled in’ with function words

5.1 Ongoing Work

There are at least four avenues open to exploration

in the near future First, applying the parser to

directed two-party dialogue will explore

context-sensitivity and a more complex grammar Second,

the architecture lends itself to further parallelism

- specifically by permitting P > 1 concurrent

processing units to dynamically decide whether to

employ the GENERALISER or SPECIFIER, based

on the sizes of shared active subspaces

We are also currently working on scoring

modules that incorporate language modelling

(with discriminative training), and prosody-based

co-analysis Finally, we have already begun work

on automatic methods to train scoring parameters,

including the distribution of ωi, and

module-specific training

6 Acknowledgements

Funding has been provided by la bourse de

maitrisse of the fonds qu´eb´ecois de la recherche

sur la nature et les technologies

References

Bidirectional Chart Parsing with a Stochastic Model, in Proc of TSD 2000, Brno, Czech Republic.

Alexandersson, J and Becker, T 2001 Overlay as the Basic Operation for Discourse Processing in a Multimodal Dialogue System in Proc of the 2nd IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, Seattle, WA.

Bolt, R.A 1980 “Put-that-there”: Voice and gesture

at the graphics interface in Proc of SIGGRAPH 80 ACM Press, New York, NY.

Boussemart, Y., Rioux, F., Rudzicz, F., Wozniewski, M., Cooperstock, J 2004 A Framework for 3D Visualisation and Manipulation in an Immersive Space using an Untethered Bimanual Gestural Interface in Proc of VRST 2004 ACM Press, Hong Kong.

Dowding, J et al 1993 Gemini: A Natural Language System For Spoken-Language Understanding in Meeting of the ACL, ACL, Morristown, NJ Holzapfel, H., Nickel, K., Stiefelhagen, R 2004 Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3D pointing gestures, in ICMI ’04: Proc of the 6th intl conference on Multimodal interfaces, ACM Press, New York, NY.

Johnston, M 1998 Unification-based multimodal parsing, in Proc of the 36th annual meeting of the ACL, ACL, Morristown, NJ.

Johnston, M., Bangalore, S 2000 Finite-state multimodal parsing and understanding in Proc of the 18th conference on Computational linguistics ACL, Morristown, NJ.

Kettebekov, S., et al 2002 Prosody Based Co-analysis of Deictic Gestures and Speech in Weather Narration Broadcast, in Workshop on Multimodal Resources and Multimodal System Evaluation (LREC 2002), Las Palmas, Spain.

McNeill, D 1992 Hand and mind: What gestures reveal about thought University of Chicago Press and CSLI Publications, Chicago, IL.

Rocio, V., Lopes, J.G 1998 Partial Parsing, Deduction and Tabling in TAPD 98

Tomita, M 1985 An Efficient Context-Free Parsing Algorithm for Natural Languages, in Proc Ninth Intl Joint Conf on Artificial Intelligence, Los Angeles, CA.

Định dạng
Số trang	6
Dung lượng	0,99 MB