Speech and language processing an introduction to natural language processing part 1

Clarke, screenplay of 2001: A Space Odyssey This book is about a new interdisciplinary field variously called computer speech and language processing or human language technology or natu

Trang 1

D R

Dave Bowman: Open the pod bay doors, HAL HAL: I’m sorry Dave, I’m afraid I can’t do that.

Stanley Kubrick and Arthur C Clarke,

screenplay of 2001: A Space Odyssey

This book is about a new interdisciplinary field variously called computer speech

and language processing or human language technology or natural language cessing or computational linguistics The goal of this new field is to get computers

pro-to perform useful tasks involving human language, tasks like enabling human-machinecommunication, improving human-human communication, or simply doing useful pro-cessing of text or speech

One example of a useful such task is a conversational agent The HAL 9000

com-CONVERSATIONAL

AGENT

puter in Stanley Kubrick’s film 2001: A Space Odyssey is one of the most recognizable

characters in twentieth-century cinema HAL is an artificial agent capable of such vanced language-processing behavior as speaking and understanding English, and at acrucial moment in the plot, even reading lips It is now clear that HAL’s creator Arthur

ad-C Clarke was a little optimistic in predicting when an artificial agent such as HALwould be available But just how far off was he? What would it take to create at leastthe language-related parts of HAL? We call programs like HAL that converse with hu-

mans via natural language conversational agents or dialogue systems In this text we

CONVERSATIONAL

AGENTS

DIALOGUE SYSTEMS study the various components that make up modern conversational agents, including

language input (automatic speech recognition and natural language

understand-ing) and language output (natural language generation and speech synthesis).

Let’s turn to another useful language-related task, that of making available to English-speaking readers the vast amount of scientific information on the Web in En-glish Or translating for English speakers the hundreds of millions of Web pages written

non-in other languages like Chnon-inese The goal of machnon-ine translation is to automatically

MACHINE

TRANSLATION

translate a document from one language to another Machine translation is far from

a solved problem; we will cover the algorithms currently used in the field, as well asimportant component tasks

Many other language processing tasks are also related to the Web Another such

task is Web-based question answering This is a generalization of simple web search,

QUESTION

ANSWERING

where instead of just typing keywords a user might ask complete questions, rangingfrom easy to hard, like the following:

Trang 2

D R

• What does “divergent” mean?

• What year was Abraham Lincoln born?

• How many states were in the United States that year?

• How much Chinese silk was exported to England by the end of the 18th century?

• What do scientists think about the ethics of human cloning?

Some of these, such as definition questions, or simple factoid questions like dates

and locations, can already be answered by search engines But answering more plicated questions might require extracting information that is embedded in other text

com-on a Web page, or doing inference (drawing ccom-onclusicom-ons based com-on known facts), or

synthesizing and summarizing information from multiple sources or web pages In thistext we study the various components that make up modern understanding systems of

this kind, including information extraction, word sense disambiguation, and so on.

Although the subfields and problems we’ve described above are all very far fromcompletely solved, these are all very active research areas and many technologies arealready available commercially In the rest of this chapter we briefly summarize the

kinds of knowledge that is necessary for these tasks (and others like spell correction,

grammar checking, and so on), as well as the mathematical models that will be

intro-duced throughout the book

What distinguishes language processing applications from other data processing

sys-tems is their use of knowledge of language Consider the Unixwcprogram, which isused to count the total number of bytes, words, and lines in a text file When used tocount bytes and lines,wcis an ordinary data processing application However, when it

is used to count the words in a file it requires knowledge about what it means to be a word, and thus becomes a language processing system.

Of course,wcis an extremely simple system with an extremely limited and poverished knowledge of language Sophisticated conversational agents like HAL,

im-or machine translation systems, im-or robust question-answering systems, require muchbroader and deeper knowledge of language To get a feeling for the scope and kind ofrequired knowledge, consider some of what HAL would need to know to engage in thedialogue that begins this chapter, or for a question answering system to answer one ofthe questions above

HAL must be able to recognize words from an audio signal and to generate an

audio signal from a sequence of words These tasks of speech recognition and speech

synthesis tasks require knowledge about phonetics and phonology; how words are

pronounced in terms of sequences of sounds, and how each of these sounds is realizedacoustically

Note also that unlike Star Trek’s Commander Data, HAL is capable of producing

contractions like I’m and can’t Producing and recognizing these and other variations

of individual words (e.g., recognizing that doors is plural) requires knowledge about

morphology, the way words break down into component parts that carry meanings like

singular versus plural.

Trang 3

D R

Moving beyond individual words, HAL must use structural knowledge to properlystring together the words that constitute its response For example, HAL must knowthat the following sequence of words will not make sense to Dave, despite the fact that

it contains precisely the same set of words as the original

I’m I do, sorry that afraid Dave I’m can’t

The knowledge needed to order and group words together comes under the heading of

syntax.

Now consider a question answering system dealing with the following question:

• How much Chinese silk was exported to Western Europe by the end of the 18th

century?

In order to answer this question we need to know something about lexical

seman-tics, the meaning of all the words (export, or silk) as well as compositional semantics

(what exactly constitutes Western Europe as opposed to Eastern or Southern Europe, what does end mean when combined with the 18th century We also need to know

something about the relationship of the words to the syntactic structure For example

we need to know that by the end of the 18th century is a temporal end-point, and not a

description of the agent, as the by-phrase is in the following sentence:

• How much Chinese silk was exported to Western Europe by southern merchants?

We also need the kind of knowledge that lets HAL determine that Dave’s utterance

is a request for action, as opposed to a simple statement about the world or a questionabout the door, as in the following variations of his original statement

STATEMENT: HAL, the pod bay door is open.

INFORMATION QUESTION: HAL, is the pod bay door open?

Next, despite its bad behavior, HAL knows enough to be polite to Dave It could,

for example, have simply replied No or No, I won’t open the door Instead, it first embellishes its response with the phrases I’m sorry and I’m afraid, and then only indi- rectly signals its refusal by saying I can’t, rather than the more direct (and truthful) I won’t.1 This knowledge about the kind of actions that speakers intend by their use of

sentences is pragmatic or dialogue knowledge.

Another kind of pragmatic or discourse knowledge is required to answer the

ques-tion

• How many states were in the United States that year?

What year is that year? In order to interpret words like that year a question

answer-ing system need to examine the the earlier questions that were asked; in this case the

previous question talked about the year that Lincoln was born Thus this task of

coref-erence resolution makes use of knowledge about how words like that or pronouns like

it or she refer to previous parts of the discourse.

To summarize, engaging in complex language behavior requires various kinds ofknowledge of language:

1 For those unfamiliar with HAL, it is neither sorry nor afraid, nor is it incapable of opening the door It has simply decided in a fit of paranoia to kill its crew.

Trang 4

D R

• Phonetics and Phonology — knowledge about linguistic sounds

• Morphology — knowledge of the meaningful components of words

• Syntax — knowledge of the structural relationships between words

• Semantics — knowledge of meaning

• Pragmatics — knowledge of the relationship of meaning to the goals and

inten-tions of the speaker

• Discourse — knowledge about linguistic units larger than a single utterance

A perhaps surprising fact about these categories of linguistic knowledge is that most

tasks in speech and language processing can be viewed as resolving ambiguity at one

AMBIGUITY

of these levels We say some input is ambiguous if there are multiple alternative

lin-AMBIGUOUS

guistic structures that can be built for it Consider the spoken sentence I made her duck.

Here’s five different meanings this sentence could have (see if you can think of somemore), each of which exemplifies an ambiguity at some level:

(1.1) I cooked waterfowl for her

(1.2) I cooked waterfowl belonging to her

(1.3) I created the (plaster?) duck she owns

(1.4) I caused her to quickly lower her head or body

(1.5) I waved my magic wand and turned her into undifferentiated waterfowl

These different meanings are caused by a number of ambiguities First, the words duck and her are morphologically or syntactically ambiguous in their part-of-speech Duck can be a verb or a noun, while her can be a dative pronoun or a possessive pronoun Second, the word make is semantically ambiguous; it can mean create or cook Finally, the verb make is syntactically ambiguous in a different way Make can be transitive,

that is, taking a single direct object (1.2), or it can be ditransitive, that is, taking two

objects (1.5), meaning that the first object (her) got made into the second object (duck) Finally, make can take a direct object and a verb (1.4), meaning that the object (her) got caused to perform the verbal action (duck) Furthermore, in a spoken sentence, there

is an even deeper kind of ambiguity; the first word could have been eye or the second word maid.

We will often introduce the models and algorithms we present throughout the book

as ways to resolve or disambiguate these ambiguities For example deciding whether

duck is a verb or a noun can be solved by part-of-speech tagging Deciding whether make means “create” or “cook” can be solved by word sense disambiguation Reso-

lution of part-of-speech and word sense ambiguities are two important kinds of lexical

disambiguation A wide variety of tasks can be framed as lexical disambiguation

problems For example, a text-to-speech synthesis system reading the word lead needs

to decide whether it should be pronounced as in lead pipe or as in lead me on By contrast, deciding whether her and duck are part of the same entity (as in (1.1) or (1.4))

or are different entity (as in (1.2)) is an example of syntactic disambiguation and can

Trang 5

D R

be addressed by probabilistic parsing Ambiguities that don’t arise in this

particu-lar example (like whether a given sentence is a statement or a question) will also be

resolved, for example by speech act interpretation.

One of the key insights of the last 50 years of research in language processing is thatthe various kinds of knowledge described in the last sections can be captured throughthe use of a small number of formal models, or theories Fortunately, these models andtheories are all drawn from the standard toolkits of computer science, mathematics, andlinguistics and should be generally familiar to those trained in those fields Among the

most important models are state machines, rule systems, logic, probabilistic models, and vector-space models These models, in turn, lend themselves to a small number

of algorithms, among the most important of which are state space search algorithms such as dynamic programming, and machine learning algorithms such as classifiers and EM and other learning algorithms.

In their simplest formulation, state machines are formal models that consist ofstates, transitions among states, and an input representation Some of the variations

of this basic model that we will consider are deterministic and non-deterministic

finite-state automata and finite-state transducers.

Closely related to these models are their declarative counterparts: formal rule

sys-tems Among the more important ones we will consider are regular grammars and

regular relations, context-free grammars, feature-augmented grammars, as well

as probabilistic variants of them all State machines and formal rule systems are themain tools used when dealing with knowledge of phonology, morphology, and syntax.The third model that plays a critical role in capturing knowledge of language is

logic We will discuss first order logic, also known as the predicate calculus, as well

as such related formalisms as lambda-calculus, feature-structures, and semantic tives These logical representations have traditionally been used for modeling seman-tics and pragmatics, although more recent work has focused on more robust techniquesdrawn from non-logical lexical semantics

primi-Probabilistic models are crucial for capturing every kind of linguistic knowledge.Each of the other models (state machines, formal rule systems, and logic) can be aug-mented with probabilities For example the state machine can be augmented with

probabilities to become the weighted automaton or Markov model We will spend

a significant amount of time on hidden Markov models or HMMs, which are used

everywhere in the field, in part-of-speech tagging, speech recognition, dialogue standing, text-to-speech, and machine translation The key advantage of probabilisticmodels is their ability to to solve the many kinds of ambiguity problems that we dis-cussed earlier; almost any speech and language processing problem can be recast as:

under-“given N choices for some ambiguous input, choose the most probable one”.

Finally, vector-space models, based on linear algebra, underlie information retrievaland many treatments of word meanings

Processing language using any of these models typically involves a search through

Trang 6

D R

a space of states representing hypotheses about an input In speech recognition, wesearch through a space of phone sequences for the correct word In parsing, we searchthrough a space of trees for the syntactic parse of an input sentence In machine trans-lation, we search through a space of translation hypotheses for the correct translation of

a sentence into another language For non-probabilistic tasks, such as state machines,

we use well-known graph algorithms such as depth-first search For probabilistic tasks, we use heuristic variants such as best-first and A* search, and rely on dynamic

programming algorithms for computational tractability

For many language tasks, we rely on machine learning tools like classifiers and

sequence models Classifiers like decision trees, support vector machines, Gaussian Mixture Models and logistic regression are very commonly used A hidden Markov

model is one kind of sequence model; other are Maximum Entropy Markov Models

or Conditional Random Fields.

Another tool that is related to machine learning is methodological; the use of

dis-tinct training and test sets, statistical techniques like cross-validation, and careful

eval-uation of our trained systems

To many, the ability of computers to process language as skillfully as we humans dowill signal the arrival of truly intelligent machines The basis of this belief is the factthat the effective use of language is intertwined with our general cognitive abilities.Among the first to consider the computational implications of this intimate connectionwas Alan Turing (1950) In this famous paper, Turing introduced what has come to be

known as the Turing Test Turing began with the thesis that the question of what it

TURING TEST

would mean for a machine to think was essentially unanswerable due to the inherent

imprecision in the terms machine and think Instead, he suggested an empirical test, a

game, in which a computer’s use of language would form the basis for determining if

it could think If the machine could win the game it would be judged intelligent

In Turing’s game, there are three participants: two people and a computer One ofthe people is a contestant and plays the role of an interrogator To win, the interrogatormust determine which of the other two participants is the machine by asking a series

of questions via a teletype The task of the machine is to fool the interrogator intobelieving it is a person by responding as a person would to the interrogator’s questions.The task of the second human participant is to convince the interrogator that the otherparticipant is the machine, and that they are human

The following interaction from Turing’s paper illustrates the kind of interactions

he had in mind Clearly a convincing simulation of humanity doesn’t require being anexpert in all domains:

Q: Please write me a sonnet on the topic of the Forth Bridge

A: Count me out on this one I never could write poetry

Q: Add 34957 to 70764

A: (Pause about 30 seconds and then give answer as) 105621

Turing predicted that roughly by the end of the twentieth century a machine with 10

Trang 7

D R

gigabytes of memory would have around a 30% chance of fooling a human interrogatorafter 5 minutes of questioning Given the fact that you can fool some of the people allthe time, it is not clear how rigorous this particular standard is Nevertheless, thecritical issue for Turing was that using language as humans do is sufficient, by itself,

as an operational test for intelligence

Surprisingly, many of the issues raised by Turing became relevant as early as 1966

in the context of the ELIZA program (Weizenbaum, 1966) ELIZA was an early ral language processing system capable of carrying on a limited form of conversationwith a user Consider the following session with a version of ELIZA that imitated theresponses of a Rogerian psychotherapist

natu-User1: You are like my father in some ways

ELIZA1: WHAT RESEMBLANCE DO YOU SEEUser2: You are not very aggressive but I think you don’t want me to notice that.ELIZA2: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE

User3: You don’t argue with me

ELIZA3: WHY DO YOU THINK I DON’T ARGUE WITH YOUUser4: You are afraid of me

ELIZA4: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU

As we will discuss in more detail in Ch 2, ELIZA is a remarkably simple programthat makes use of pattern-matching to process the input and translate it into suitableoutputs The success of this simple technique in this domain is due to the fact that

ELIZA doesn’t actually need to know anything to mimic a Rogerian psychotherapist.

As Weizenbaum notes, this is one of the few dialogue genres where the listener can act

as if they know nothing of the world

ELIZA’s deep relevance to Turing’s ideas is that many people who interacted with

ELIZA came to believe that it really understood them and their problems Indeed,

Weizenbaum (1976) notes that many of these people continued to believe in ELIZA’sabilities even after the program’s operation was explained to them In more recentyears, Weizenbaum’s informal reports have been repeated in a somewhat more con-trolled setting Since 1991, an event known as the Loebner Prize competition hasattempted to put various computer programs to the Turing test Although these con-tests seem to have little scientific interest, a consistent result over the years has beenthat even the crudest programs can fool some of the judges some of the time (Shieber,1994) Not surprisingly, these results have done nothing to quell the ongoing debateover the suitability of the Turing test as a test for intelligence among philosophers and

AI researchers (Searle, 1980)

Fortunately, for the purposes of this book, the relevance of these results does nothinge on whether or not computers will ever be intelligent, or understand natural lan-guage Far more important is recent related research in the social sciences that hasconfirmed another of Turing’s predictions from the same paper

Nevertheless I believe that at the end of the century the use of words andeducated opinion will have altered so much that we will be able to speak

of machines thinking without expecting to be contradicted

It is now clear that regardless of what people believe or know about the inner workings

of computers, they talk about them and interact with them as social entities People act

Trang 8

D R

toward computers as if they were people; they are polite to them, treat them as teammembers, and expect among other things that computers should be able to understandtheir needs, and be capable of interacting with them naturally For example, Reevesand Nass (1996) found that when a computer asked a human to evaluate how well thecomputer had been doing, the human gives more positive responses than when a differ-ent computer asks the same questions People seemed to be afraid of being impolite In

a different experiment, Reeves and Nass found that people also give computers higherperformance ratings if the computer has recently said something flattering to the hu-man Given these predispositions, speech and language-based systems may providemany users with the most natural interface for many applications This fact has led to

a long-term focus in the field on the design of conversational agents, artificial entities

that communicate conversationally

We can only see a short distance ahead, but we can see plenty there that needs to

be done

Alan Turing.This is an exciting time for the field of speech and language processing Thestartling increase in computing resources available to the average computer user, therise of the Web as a massive source of information and the increasing availability ofwireless mobile access have all placed speech and language processing applications

in the technology spotlight The following are examples of some currently deployedsystems that reflect this trend:

• Travelers calling Amtrak, United Airlines and other travel-providers interact

with conversational agents that guide them through the process of making vations and getting arrival and departure information

reser-• Luxury car makers such as Mercedes-Benz models provide automatic speech

recognition and text-to-speech systems that allow drivers to control their ronmental, entertainment and navigational systems by voice A similar spokendialogue system has been deployed by astronauts on the International Space Sta-tion

envi-• Blinkx, and other video search companies, provide search services for million of

hours of video on the Web by using speech recognition technology to capture thewords in the sound track

• Google provides cross-language information retrieval and translation services

where a user can supply queries in their native language to search collections inanother language Google translates the query, finds the most relevant pages andthen automatically translates them back to the user’s native language

• Large educational publishers such as Pearson, as well as testing services like

ETS, use automated systems to analyze thousands of student essays, grading andassessing them in a manner that is indistinguishable from human graders

Trang 9

D R

• Interactive tutors, based on lifelike animated characters, serve as tutors for

chil-dren learning to read, and as therapists for people dealing with aphasia andParkinsons disease (?, ?)

• Text analysis companies such as Nielsen Buzzmetrics, Umbria, and Collective

Intellect, provide marketing intelligence based on automated measurements ofuser opinions, preferences, attitudes as expressed in weblogs, discussion forumsand and user groups

Historically, speech and language processing has been treated very differently in puter science, electrical engineering, linguistics, and psychology/cognitive science.Because of this diversity, speech and language processing encompasses a number of

com-different but overlapping fields in these com-different departments: computational

linguis-tics in linguislinguis-tics, natural language processing in computer science, speech tion in electrical engineering, computational psycholinguistics in psychology This

recogni-section summarizes the different historical threads which have given rise to the field

of speech and language processing This section will provide only a sketch; see theindividual chapters for more detail on each area and its terminology

1.6.1 Foundational Insights: 1940s and 1950s

The earliest roots of the field date to the intellectually fertile period just after WorldWar II that gave rise to the computer itself This period from the 1940s through the end

of the 1950s saw intense work on two foundational paradigms: the automaton and

probabilistic or information-theoretic models.

The automaton arose in the 1950s out of Turing’s (1936) model of algorithmiccomputation, considered by many to be the foundation of modern computer science

Turing’s work led first to the McCulloch-Pitts neuron (McCulloch and Pitts, 1943), a

simplified model of the neuron as a kind of computing element that could be described

in terms of propositional logic, and then to the work of Kleene (1951) and (1956) onfinite automata and regular expressions Shannon (1948) applied probabilistic models

of discrete Markov processes to automata for language Drawing the idea of a state Markov process from Shannon’s work, Chomsky (1956) first considered finite-state machines as a way to characterize a grammar, and defined a finite-state language

finite-as a language generated by a finite-state grammar These early models led to the field of

formal language theory, which used algebra and set theory to define formal languages

as sequences of symbols This includes the context-free grammar, first defined byChomsky (1956) for natural languages but independently discovered by Backus (1959)and Naur et al (1960) in their descriptions of the ALGOL programming language.The second foundational insight of this period was the development of probabilisticalgorithms for speech and language processing, which dates to Shannon’s other con-

tribution: the metaphor of the noisy channel and decoding for the transmission of

language through media like communication channels and speech acoustics Shannon

Trang 10

D R

also borrowed the concept of entropy from thermodynamics as a way of measuring

the information capacity of a channel, or the information content of a language, andperformed the first measure of the entropy of English using probabilistic techniques

It was also during this early period that the sound spectrograph was developed(Koenig et al., 1946), and foundational research was done in instrumental phoneticsthat laid the groundwork for later work in speech recognition This led to the firstmachine speech recognizers in the early 1950s In 1952, researchers at Bell Labs built

a statistical system that could recognize any of the 10 digits from a single speaker(Davis et al., 1952) The system had 10 speaker-dependent stored patterns roughlyrepresenting the first two vowel formants in the digits They achieved 97–99% accuracy

by choosing the pattern which had the highest relative correlation coefficient with theinput

1.6.2 The Two Camps: 1957–1970

By the end of the 1950s and the early 1960s, speech and language processing had splitvery cleanly into two paradigms: symbolic and stochastic

The symbolic paradigm took off from two lines of research The first was the work

of Chomsky and others on formal language theory and generative syntax throughout thelate 1950s and early to mid 1960s, and the work of many linguistics and computer sci-entists on parsing algorithms, initially top-down and bottom-up and then via dynamicprogramming One of the earliest complete parsing systems was Zelig Harris’s Trans-formations and Discourse Analysis Project (TDAP), which was implemented betweenJune 1958 and July 1959 at the University of Pennsylvania (Harris, 1962).2 The sec-ond line of research was the new field of artificial intelligence In the summer of 1956John McCarthy, Marvin Minsky, Claude Shannon, and Nathaniel Rochester broughttogether a group of researchers for a two-month workshop on what they decided to callartificial intelligence (AI) Although AI always included a minority of researchers fo-cusing on stochastic and statistical algorithms (include probabilistic models and neuralnets), the major focus of the new field was the work on reasoning and logic typified byNewell and Simon’s work on the Logic Theorist and the General Problem Solver Atthis point early natural language understanding systems were built, These were simplesystems that worked in single domains mainly by a combination of pattern matchingand keyword search with simple heuristics for reasoning and question-answering Bythe late 1960s more formal logical systems were developed

The stochastic paradigm took hold mainly in departments of statistics and of trical engineering By the late 1950s the Bayesian method was beginning to be applied

elec-to the problem of optical character recognition Bledsoe and Browning (1959) built

a Bayesian system for text-recognition that used a large dictionary and computed thelikelihood of each observed letter sequence given each word in the dictionary by mul-tiplying the likelihoods for each letter Mosteller and Wallace (1964) applied Bayesian

methods to the problem of authorship attribution on The Federalist papers.

The 1960s also saw the rise of the first serious testable psychological models of

2 This system was reimplemented recently and is described by Joshi and Hopely (1999) and Karttunen (1999), who note that the parser was essentially implemented as a cascade of finite-state transducers.

Trang 11

D R

human language processing based on transformational grammar, as well as the firston-line corpora: the Brown corpus of American English, a 1 million word collection ofsamples from 500 written texts from different genres (newspaper, novels, non-fiction,academic, etc.), which was assembled at Brown University in 1963–64 (Kuˇcera andFrancis, 1967; Francis, 1979; Francis and Kuˇcera, 1982), and William S Y Wang’s

1967 DOC (Dictionary on Computer), an on-line Chinese dialect dictionary

1.6.3 Four Paradigms: 1970–1983

The next period saw an explosion in research in speech and language processing andthe development of a number of research paradigms that still dominate the field

The stochastic paradigm played a huge role in the development of speech

recog-nition algorithms in this period, particularly the use of the Hidden Markov Model andthe metaphors of the noisy channel and decoding, developed independently by Jelinek,Bahl, Mercer, and colleagues at IBM’s Thomas J Watson Research Center, and byBaker at Carnegie Mellon University, who was influenced by the work of Baum andcolleagues at the Institute for Defense Analyses in Princeton AT&T’s Bell Laborato-ries was also a center for work on speech recognition and synthesis; see Rabiner andJuang (1993) for descriptions of the wide range of this work

The logic-based paradigm was begun by the work of Colmerauer and his

col-leagues on Q-systems and metamorphosis grammars (Colmerauer, 1970, 1975), theforerunners of Prolog, and Definite Clause Grammars (Pereira and Warren, 1980) In-dependently, Kay’s (1979) work on functional grammar, and shortly later, Bresnan andKaplan’s (1982) work on LFG, established the importance of feature structure unifica-tion

The natural language understanding field took off during this period, beginning

with Terry Winograd’s SHRDLU system, which simulated a robot embedded in a world

of toy blocks (Winograd, 1972) The program was able to accept natural language text

commands (Move the red block on top of the smaller green one) of a hitherto unseen

complexity and sophistication His system was also the first to attempt to build anextensive (for the time) grammar of English, based on Halliday’s systemic grammar.Winograd’s model made it clear that the problem of parsing was well-enough under-stood to begin to focus on semantics and discourse models Roger Schank and his

colleagues and students (in what was often referred to as the Yale School) built a

se-ries of language understanding programs that focused on human conceptual knowledgesuch as scripts, plans and goals, and human memory organization (Schank and Albel-son, 1977; Schank and Riesbeck, 1981; Cullingford, 1981; Wilensky, 1983; Lehnert,1977) This work often used network-based semantics (Quillian, 1968; Norman andRumelhart, 1975; Schank, 1972; Wilks, 1975b, 1975a; Kintsch, 1974) and began toincorporate Fillmore’s notion of case roles (Fillmore, 1968) into their representations(Simmons, 1973)

The logic-based and natural-language understanding paradigms were unified onsystems that used predicate logic as a semantic representation, such as the LUNARquestion-answering system (Woods, 1967, 1973)

The discourse modeling paradigm focused on four key areas in discourse Grosz

and her colleagues introduced the study of substructure in discourse, and of discourse

Trang 12

D R

focus (Grosz, 1977; Sidner, 1983), a number of researchers began to work on automatic

reference resolution (Hobbs, 1978), and the BDI (Belief-Desire-Intention) framework

for logic-based work on speech acts was developed (Perrault and Allen, 1980; Cohenand Perrault, 1979)

1.6.4 Empiricism and Finite State Models Redux: 1983–1993

This next decade saw the return of two classes of models which had lost popularity inthe late 1950s and early 1960s, partially due to theoretical arguments against them such

as Chomsky’s influential review of Skinner’s Verbal Behavior (Chomsky, 1959) The

first class was finite-state models, which began to receive attention again after work

on finite-state phonology and morphology by Kaplan and Kay (1981) and finite-statemodels of syntax by Church (1980) A large body of work on finite-state models will

be described throughout the book

The second trend in this period was what has been called the “return of cism”; most notably here was the rise of probabilistic models throughout speech andlanguage processing, influenced strongly by the work at the IBM Thomas J WatsonResearch Center on probabilistic models of speech recognition These probabilisticmethods and other such data-driven approaches spread from speech into part-of-speechtagging, parsing and attachment ambiguities, and semantics This empirical directionwas also accompanied by a new focus on model evaluation, based on using held-outdata, developing quantitative metrics for evaluation, and emphasizing the comparison

empiri-of performance on these metrics with previous published research

This period also saw considerable work on natural language generation

1.6.5 The Field Comes Together: 1994–1999

By the last five years of the millennium it was clear that the field was vastly ing First, probabilistic and data-driven models had become quite standard throughoutnatural language processing Algorithms for parsing, part-of-speech tagging, referenceresolution, and discourse processing all began to incorporate probabilities, and employevaluation methodologies borrowed from speech recognition and information retrieval.Second, the increases in the speed and memory of computers had allowed commercialexploitation of a number of subareas of speech and language processing, in particularspeech recognition and spelling and grammar checking Speech and language process-ing algorithms began to be applied to Augmentative and Alternative Communication(AAC) Finally, the rise of the Web emphasized the need for language-based informa-tion retrieval and information extraction

chang-’

1.6.6 The Rise of Machine Learning: 2000–2007

The empiricist trends begun in the latter part of the 1990s accelerated at an ing pace in the new century This acceleration was largely driven by three synergistictrends First, large amounts of spoken and written material became widely availablethrough the auspices of the Linguistic Data Consortium (LDC), and other similar or-

Trang 13

astound-D R

ganizations Importantly, included among these materials were annotated collectionssuch as the Penn Treebank(Marcus et al., 1993), Prague Dependency Treebank(Hajiˇc,1998), PropBank(Palmer et al., 2005), Penn Discourse Treebank(Miltsakaki et al.,2004), RSTBank(Carlson et al., 2001) and TimeBank(?), all of which layered standardtext sources with various forms of syntactic, semantic and pragmatic annotations Theexistence of these resources promoted the trend of casting more complex traditionalproblems, such as parsing and semantic analysis, as problems in supervised machinelearning These resources also promoted the establishment of additional competitiveevaluations for parsing (Dejean and Tjong Kim Sang, 2001), information extraction(?,

?), word sense disambiguation(Palmer et al., 2001; Kilgarriff and Palmer, 2000) andquestion answering(Voorhees and Tice, 1999)

Second, this increased focus on learning led to a more serious interplay with thestatistical machine learning community Techniques such as support vector machines(?; Vapnik, 1995), multinomial logistic regression (MaxEnt) (Berger et al., 1996), andgraphical Bayesian models (Pearl, 1988) became standard practice in computationallinguistics Third, the widespread availability of high-performance computing systemsfacilitated the training and deployment of systems that could not have been imagined adecade earlier

Finally, near the end of this period, largely unsupervised statistical approaches gan to receive renewed attention Progress on statistical approaches to machine trans-lation(Brown et al., 1990; Och and Ney, 2003) and topic modeling (?) demonstratedthat effective applications could be constructed from systems trained on unannotateddata alone In addition, the widespread cost and difficulty of producing reliably anno-tated corpora became a limiting factor in the use of supervised approaches for manyproblems This trend towards the use unsupervised techniques will likely increase

be-1.6.7 On Multiple Discoveries

Even in this brief historical overview, we have mentioned a number of cases of multipleindependent discoveries of the same idea Just a few of the “multiples” to be discussed

in this book include the application of dynamic programming to sequence comparison

by Viterbi, Vintsyuk, Needleman and Wunsch, Sakoe and Chiba, Sankoff, Reichert

et al., and Wagner and Fischer (Chapters 3, 5 and 6) the HMM/noisy channel model

of speech recognition by Baker and by Jelinek, Bahl, and Mercer (Chapters 6, 9, and10); the development of context-free grammars by Chomsky and by Backus and Naur(Chapter 12); the proof that Swiss-German has a non-context-free syntax by Huybregtsand by Shieber (Chapter 15); the application of unification to language processing by

Colmerauer et al and by Kay in (Chapter 16).

Are these multiples to be considered astonishing coincidences? A well-known pothesis by sociologist of science Robert K Merton (1961) argues, quite the contrary,that

hy-all scientific discoveries are in principle multiples, including those that onthe surface appear to be singletons

Of course there are many well-known cases of multiple discovery or invention; just afew examples from an extensive list in Ogburn and Thomas (1922) include the multiple

Trang 14

D R

invention of the calculus by Leibnitz and by Newton, the multiple development of thetheory of natural selection by Wallace and by Darwin, and the multiple invention ofthe telephone by Gray and Bell.3 But Merton gives a further array of evidence for thehypothesis that multiple discovery is the rule rather than the exception, including manycases of putative singletons that turn out be a rediscovery of previously unpublished orperhaps inaccessible work An even stronger piece of evidence is his ethnomethodolog-ical point that scientists themselves act under the assumption that multiple invention isthe norm Thus many aspects of scientific life are designed to help scientists avoid be-ing “scooped”; submission dates on journal articles; careful dates in research records;circulation of preliminary or technical reports

1.6.8 A Final Brief Note on Psychology

Many of the chapters in this book include short summaries of psychological research

on human processing Of course, understanding human language processing is an portant scientific goal in its own right and is part of the general field of cognitive sci-ence However, an understanding of human language processing can often be helpful

im-in buildim-ing better machim-ine models of language This seems contrary to the popularwisdom, which holds that direct mimicry of nature’s algorithms is rarely useful in en-gineering applications For example, the argument is often made that if we copiednature exactly, airplanes would flap their wings; yet airplanes with fixed wings are amore successful engineering solution But language is not aeronautics Cribbing fromnature is sometimes useful for aeronautics (after all, airplanes do have wings), but it isparticularly useful when we are trying to solve human-centered tasks Airplane flighthas different goals than bird flight; but the goal of speech recognition systems, for ex-ample, is to perform exactly the task that human court reporters perform every day:transcribe spoken dialog Since people already do this well, we can learn from nature’sprevious solution Since an important application of speech and language processingsystems is for human-computer interaction, it makes sense to copy a solution that be-haves the way people are accustomed to

This chapter introduces the field of speech and language processing The following aresome of the highlights of this chapter

• A good way to understand the concerns of speech and language processing

re-search is to consider what it would take to create an intelligent agent like HALfrom 2001: A Space Odyssey, or build a web-based question answerer, or a ma-chine translation engine

• Speech and language technology relies on formal models, or representations, of

3 Ogburn and Thomas are generally credited with noticing that the prevalence of multiple inventions gests that the cultural milieu and not individual genius is the deciding causal factor in scientific discovery In

sug-an amusing bit of recursion, however, Merton notes that even this idea has been multiply discovered, citing sources from the 19th century and earlier!

Trang 15

• The foundations of speech and language technology lie in computer science,

lin-guistics, mathematics, electrical engineering and psychology A small number ofalgorithms from standard frameworks are used throughout speech and languageprocessing,

• The critical connection between language and thought has placed speech and

language processing technology at the center of debate over intelligent machines.Furthermore, research on how people interact with complex media indicates thatspeech and language processing technology will be critical in the development

of future technologies

• Revolutionary applications of speech and language processing are currently in

use around the world The creation of the web, as well as significant recentimprovements in speech recognition and synthesis, will lead to many more ap-plications

Research in the various subareas of speech and language processing is spread across

a wide number of conference proceedings and journals The conferences and journalsmost centrally concerned with natural language processing and computational linguis-tics are associated with the Association for Computational Linguistics (ACL), its Eu-ropean counterpart (EACL), and the International Conference on Computational Lin-guistics (COLING) The annual proceedings of ACL, NAACL, and EACL, and thebiennial COLING conference are the primary forums for work in this area Relatedconferences include various proceedings of ACL Special Interest Groups (SIGs) such

as the Conference on Natural Language Learning (CoNLL), as well as the conference

on Empirical Methods in Natural Language Processing (EMNLP)

Research on speech recognition, understanding, and synthesis is presented at theannual INTERSPEECH conference, which is called the International Conference onSpoken Language Processing (ICSLP) and the European Conference on Speech Com-munication and Technology (EUROSPEECH) in alternating years, or the annual IEEEInternational Conference on Acoustics, Speech, and Signal Processing (IEEE ICASSP).Spoken language dialogue research is presented at these or at workshops like SIGDial

Journals include Computational Linguistics, Natural Language Engineering, Speech Communication, Computer Speech and Language, the IEEE Transactions on Audio, Speech & Language Processing and the ACM Transactions on Speech and Language Processing.

Work on language processing from an Artificial Intelligence perspective can befound in the annual meetings of the American Association for Artificial Intelligence(AAAI), as well as the biennial International Joint Conference on Artificial Intelli-

Trang 16

D R

gence (IJCAI) meetings Artificial intelligence journals that periodically feature work

on speech and language processing include Machine Learning, Journal of Machine Learning Research, and the Journal of Artificial Intelligence Research.

There are a fair number of textbooks available covering various aspects of speech

and language processing Manning and Sch ¨utze (1999) (Foundations of Statistical guage Processing) focuses on statistical models of tagging, parsing, disambiguation, collocations, and other areas Charniak (1993) (Statistical Language Learning) is an

Lan-accessible, though older and less-extensive, introduction to similar material Manning

et al (2008) focuses on information retrieval, text classification, and clustering NLTK,the Natural Language Toolkit (Bird and Loper, 2004), is a suite of Python modulesand data for natural language processing, together with a Natural Language Process-

ing book based on the NLTK suite Allen (1995) (Natural Language Understanding)

provides extensive coverage of language processing from the AI perspective Gazdar

and Mellish (1989) (Natural Language Processing in Lisp/Prolog) covers especially

automata, parsing, features, and unification and is available free online Pereira andShieber (1987) gives a Prolog-based introduction to parsing and interpretation Russelland Norvig (2002) is an introduction to artificial intelligence that includes chapters onnatural language processing Partee et al (1990) has a very broad coverage of mathe-matical linguistics A historically significant collection of foundational papers can be

found in Grosz et al (1986) (Readings in Natural Language Processing).

Of course, a wide-variety of speech and language processing resources are nowavailable on the Web Pointers to these resources are maintained on the home-page forthis book at:

http://www.cs.colorado.edu/˜martin/slp.html

Trang 17

D R

Allen, J (1995) Natural Language Understanding Benjamin

Cummings, Menlo Park, CA.

Backus, J W (1959) The syntax and semantics of the proposed

international algebraic language of the Zurch ACM-GAMM

Conference In Information Processing: Proceedings of the

International Conference on Information Processing, Paris,

pp 125–132 UNESCO.

Berger, A., Della Pietra, S A., and Della Pietra, V J (1996) A

maximum entropy approach to natural language processing.

Computational Linguistics, 22(1), 39–71.

Bird, S and Loper, E (2004) NLTK: The Natural Language

Toolkit In Proceedings of the ACL 2004 demonstration

ses-sion, Barcelona, Spain, pp 214–217.

Bledsoe, W W and Browning, I (1959) Pattern recognition

and reading by machine In 1959 Proceedings of the Eastern

Joint Computer Conference, pp 225–232 Academic, New

York.

Bresnan, J and Kaplan, R M (1982) Introduction: Grammars

as mental representations of language In Bresnan, J (Ed.),

The Mental Representation of Grammatical Relations MIT

Press, Cambridge, MA.

Brown, P F., Cocke, J., Della Pietra, S A., Della Pietra, V J.,

Jelinek, F., Lafferty, J D., Mercer, R L., and Roossin, P S.

(1990) A statistical approach to machine translation

Com-putational Linguistics, 16(2), 79–85.

Carlson, L., Marcu, D., and Okurowski, M E (2001)

Build-ing a discourse-tagged corpus in the framework of rhetorical

structure theory In Proceedings of SIGDIAL.

Charniak, E (1993) Statistical Language Learning MIT Press.

Chomsky, N (1956) Three models for the description of

lan-guage IRI Transactions on Information Theory, 2(3), 113–

124.

Chomsky, N (1959) A review of B F Skinner’s “Verbal

Be-havior” Language, 35, 26–58.

Church, K W (1980) On memory limitations in natural

lan-guage processing Master’s thesis, MIT Distributed by the

Indiana University Linguistics Club.

Cohen, P R and Perrault, C R (1979) Elements of a

plan-based theory of speech acts Cognitive Science, 3(3), 177–

212.

Colmerauer, A (1970) Les syst`emes-q ou un formalisme pour

analyser et synth´etiser des phrase sur ordinateur Internal

pub-lication 43, D´epartement d’informatique de l’Universit´e de

Montr´eal†.

Colmerauer, A (1975) Les grammaires de m´etamorphose GIA.

Internal publication, Groupe Intelligence artificielle, Facult´e

des Sciences de Luminy, Universit´e Aix-Marseille II, France,

Nov 1975 English version, Metamorphosis grammars In L.

Bolc, (Ed.), Natural Language Communication with

Comput-ers, Lecture Notes in Computer Science 63, Springer Verlag,

Berlin, 1978, pp 133–189.

Cullingford, R E (1981) SAM In Schank, R C and Riesbeck,

C K (Eds.), Inside Computer Understanding: Five Programs

plus Miniatures, pp 75–119 Lawrence Erlbaum, Hillsdale,

NJ.

Davis, K H., Biddulph, R., and Balashek, S (1952) Automatic

recognition of spoken digits Journal of the Acoustical Society

of America, 24(6), 637–642.

Dejean, H and Tjong Kim Sang, E F (2001) Introduction to

the CoNLL-2001 shared task: Clause identification In

Pro-ceedings of CoNLL-2001.

Fillmore, C J (1968) The case for case In Bach, E W and

Harms, R T (Eds.), Universals in Linguistic Theory, pp 1–

88 Holt, Rinehart & Winston, New York.

Francis, W N (1979) A tagged corpus – problems and prospects In Greenbaum, S., Leech, G., and Svartvik, J.

(Eds.), Studies in English linguistics for Randolph Quirk, pp.

192–209 Longman, London and New York.

Francis, W N and Kuˇcera, H (1982) Frequency Analysis of

English Usage Houghton Mifflin, Boston.

Gazdar, G and Mellish, C (1989) Natural Language

Process-ing in LISP Addison Wesley.

Grosz, B J (1977) The representation and use of focus in a

system for understanding dialogs In IJCAI-77, Cambridge,

MA, pp 67–76 Morgan Kaufmann Reprinted in Grosz et al (1986).

Grosz, B J., Jones, K S., and Webber, B L (Eds.) (1986).

Readings in Natural Language Processing Morgan

Kauf-mann, Los Altos, Calif.

Hajiˇc, J (1998) Building a Syntactically Annotated Corpus:

The Prague Dependency Treebank, pp 106–132 Karolinum,

Prague/Praha.

Harris, Z S (1962) String Analysis of Sentence Structure.

Mouton, The Hague.

Hobbs, J R (1978) Resolving pronoun references Lingua,

44, 311–338 Reprinted in Grosz et al (1986).

Joshi, A K and Hopely, P (1999) A parser from antiquity In

Kornai, A (Ed.), Extended Finite State Models of Language,

pp 6–15 Cambridge University Press, Cambridge Kaplan, R M and Kay, M (1981) Phonological rules and finite-state transducers Paper presented at the Annual meet- ing of the Linguistics Society of America New York Karttunen, L (1999) Comments on Joshi In Kornai, A (Ed.),

Extended Finite State Models of Language, pp 16–18

Cam-bridge University Press, CamCam-bridge.

Kay, M (1979) Functional grammar In BLS-79, Berkeley, CA,

pp 142–158.

Kilgarriff, A and Palmer, M (Eds.) (2000) Computing and the

Humanities: Special Issue on SENSEVAL, Vol 34 Kluwer.

Kintsch, W (1974) The Representation of Meaning in Memory.

Wiley, New York.

Kleene, S C (1951) Representation of events in nerve nets and finite automata Tech rep RM-704, RAND Corporation RAND Research Memorandum†.

Trang 18

D R

Kleene, S C (1956) Representation of events in nerve nets and

finite automata In Shannon, C and McCarthy, J (Eds.),

Au-tomata Studies, pp 3–41 Princeton University Press,

Prince-ton, NJ.

Koenig, W., Dunn, H K., Y., L., and Lacy (1946) The sound

spectrograph Journal of the Acoustical Society of America,

18, 19–49.

Kuˇcera, H and Francis, W N (1967) Computational

analy-sis of present-day American English Brown University Press,

Providence, RI.

Lehnert, W G (1977) A conceptual theory of question

an-swering In IJCAI-77, Cambridge, MA, pp 158–164 Morgan

Kaufmann.

Manning, C D., Raghavan, P., and Sch¨utze, H (2008)

In-troduction to Information Retrieval Cambridge University

Press, Cambridge, UK.

Manning, C D and Sch¨utze, H (1999) Foundations of

Statis-tical Natural Language Processing MIT Press, Cambridge,

MA.

Marcus, M P., Santorini, B., and Marcinkiewicz, M A (1993).

Building a large annotated corpus of English: The Penn

tree-bank Computational Linguistics, 19(2), 313–330.

McCulloch, W S and Pitts, W (1943) A logical calculus of

ideas immanent in nervous activity Bulletin of Mathematical

Biophysics, 5, 115–133 Reprinted in Neurocomputing:

Foun-dations of Research, ed by J A Anderson and E Rosenfeld.

MIT Press 1988.

Merton, R K (1961) Singletons and multiples in scientific

dis-covery American Philosophical Society Proceedings, 105(5),

470–486.

Miltsakaki, E., Prasad, R., Joshi, A K., and Webber, B L.

(2004) The Penn Discourse Treebank In LREC-04.

Mosteller, F and Wallace, D L (1964) Inference and Disputed

Authorship: The Federalist Springer-Verlag, New York 2nd

Edition appeared in 1984 and was called Applied Bayesian

and Classical Inference.

Naur, P., Backus, J W., Bauer, F L., Green, J., Katz, C.,

McCarthy, J., Perlis, A J., Rutishauser, H., Samelson, K.,

Vauquois, B., Wegstein, J H., van Wijnagaarden, A., and

Woodger, M (1960) Report on the algorithmic language

AL-GOL 60 Communications of the ACM, 3(5), 299–314

Re-vised in CACM 6:1, 1-17, 1963.

Norman, D A and Rumelhart, D E (1975) Explorations in

Cognition Freeman, San Francisco, CA.

Och, F J and Ney, H (2003) A systematic comparison of

var-ious statistical alignment models Computational Linguistics,

29(1), 19–51.

Ogburn, W F and Thomas, D S (1922) Are inventions

in-evitable? A note on social evolution Political Science

Quar-terly, 37, 83–98.

Palmer, M., Fellbaum, C., Cotton, S., Delfs, L., and Dang,

H T (2001) English tasks: All-words and verb lexical

sam-ple In Proceedings of SENSEVAL-2: Second International

Workshop on Evaluating Word Sense Disambiguation tems, Toulouse, France.

Sys-Palmer, M., Kingsbury, P., and Gildea, D (2005) The

proposi-tion bank: An annotated corpus of semantic roles

Computa-tional Linguistics, 31(1), 71–106.

Partee, B H., ter Meulen, A., and Wall, R E (1990)

Mathe-matical Methods in Linguistics Kluwer, Dordrecht.

Pearl, J (1988) Probabilistic Reasoning in Intelligent Systems:

Networks of Plausible Inference Morgan Kaufman, San

Ma-teo, Ca.

Pereira, F C N and Shieber, S M (1987) Prolog and

Natural-Language Analysis, Vol 10 of CSLI Lecture Notes Chicago

University Press, Chicago.

Pereira, F C N and Warren, D H D (1980) Definite clause grammars for language analysis— a survey of the formalism

and a comparison with augmented transition networks

Artifi-cial Intelligence, 13(3), 231–278.

Perrault, C R and Allen, J (1980) A plan-based analysis of

indirect speech acts American Journal of Computational

Lin-guistics, 6(3-4), 167–182.

Quillian, M R (1968) Semantic memory In Minsky, M (Ed.),

Semantic Information Processing, pp 227–270 MIT Press,

Cambridge, MA.

Rabiner, L R and Juang, B (1993) Fundamentals of Speech

Recognition Prentice Hall, Englewood Cliffs, NJ.

Reeves, B and Nass, C (1996) The Media Equation: How

People Treat Computers, Television, and New Media Like Real People and Places Cambridge University Press, Cambridge.

Russell, S and Norvig, P (2002) Artificial Intelligence: A

Modern Approach Prentice Hall, Englewood Cliffs, NJ

Sec-ond edition.

Schank, R C (1972) Conceptual dependency: A theory of

nat-ural language processing Cognitive Psychology, 3, 552–631 Schank, R C and Albelson, R P (1977) Scripts, Plans, Goals

and Understanding Lawrence Erlbaum, Hillsdale, NJ.

Schank, R C and Riesbeck, C K (Eds.) (1981). Inside Computer Understanding: Five Programs plus Miniatures.

Lawrence Erlbaum, Hillsdale, NJ.

Searle, J R (1980) Minds, brains, and programs Behavioral

and Brain Sciences, 3, 417–457.

Shannon, C E (1948) A mathematical theory of

communica-tion Bell System Technical Journal, 27(3), 379–423

Contin-ued in following volume.

Shieber, S M (1994) Lessons from a restricted Turing test.

Communications of the ACM, 37(6), 70–78.

Sidner, C L (1983) Focusing in the comprehension of definite

anaphora In Brady, M and Berwick, R C (Eds.),

Compu-tational Models of Discourse, pp 267–330 MIT Press,

Cam-bridge, MA.

Simmons, R F (1973) Semantic networks: Their tion and use for understanding English sentences In Schank,

computa-R C and Colby, K M (Eds.), Computer Models of Thought

and Language, pp 61–113 W.H Freeman and Co., San

Fran-cisco.

Trang 19

D R

Turing, A M (1936) On computable numbers, with an

ap-plication to the Entscheidungsproblem Proceedings of the

London Mathematical Society, 42, 230–265 Read to the

So-ciety in 1936, but published in 1937 Correction in volume 43,

544–546.

Turing, A M (1950) Computing machinery and intelligence.

Mind, 59, 433–460.

Vapnik, V N (1995) The Nature of Statistical Learning

The-ory Springer-Verlag, New York.

Voorhees, E M and Tice, D M (1999) The TREC-8

ques-tion answering track evaluaques-tion Proceedings of the TREC-8

Workshop.

Weizenbaum, J (1966) ELIZA – A computer program for the

study of natural language communication between man and

machine Communications of the ACM, 9(1), 36–45.

Weizenbaum, J (1976) Computer Power and Human Reason:

From Judgement to Calculation W.H Freeman and Company,

San Francisco.

Wilensky, R (1983) Planning and Understanding

Addison-Wesley, Reading, MA.

Wilks, Y (1975a) Preference semantics In Keenan, E L.

(Ed.), The Formal Semantics of Natural Language, pp 329–

350 Cambridge Univ Press, Cambridge.

Wilks, Y (1975b) A preferential, pattern-seeking, semantics

for natural language inference Artificial Intelligence, 6(1),

53–74.

Winograd, T (1972) Understanding natural language

Cogni-tive Psychology, 3(1), 1–191 Reprinted as a book by

Aca-demic Press, 1972.

Woods, W A (1967) Semantics for a Question-Answering

System Ph.D thesis, Harvard University.

Woods, W A (1973) Progress in natural language

understand-ing In Proceedings of AFIPS National Conference, pp 441–

450.

Trang 20

D R

without permission.

User1: Men are all alike.

ELIZA1: IN WHAT WAY User2: They’re always bugging us about something or other ELIZA2: CAN YOU THINK OF A SPECIFIC EXAMPLE User3: Well, my boyfriend made me come here.

ELIZA3: YOUR BOYFRIEND MADE YOU COME HERE User4: He says I’m depressed much of the time.

ELIZA4: I AM SORRY TO HEAR YOU ARE DEPRESSED.

inspira-and lemurs” inspira-and “all about Vermont’s unique, endangered species” Instead of having

to do this search twice, you would have rather typed one search command

specify-ing somethspecify-ing like woodchuck with an optional final s Or perhaps you might want

to search for all the prices in some document; you might want to see all strings thatlook like$199 or$25 or$24.99 In this chapter we introduce the regular expression,

the standard notation for characterizing text sequences The regular expression is usedfor specifying text strings in situations like this Web-search example, and in other in-formation retrieval applications, but also plays an important role in word-processing,computation of frequencies from corpora, and other such tasks

After we have defined regular expressions, we show how they can be implemented

via the finite-state automaton The finite-state automaton is not only the

mathemati-cal device used to implement regular expressions, but also one of the most significanttools of computational linguistics Variations of automata such as finite-state trans-

ducers, Hidden Markov Models, and N-gram grammars are important components of

applications that we will introduce in later chapters, including speech recognition andsynthesis, machine translation, spell-checking, and information-extraction

Trang 21

D R

SIR ANDREW: Her C’s, her U’s and her T’s: why that?

Shakespeare, Twelfth Night

One of the unsung successes in standardization in computer science has been the

regular expression (RE), a language for specifying text search strings The regular

A regular expression (first developed by Kleene (1956) but see the History sectionfor more details) is a formula in a special language that is used for specifying simple

classes of strings A string is a sequence of symbols; for the purpose of most

text-STRINGS

based search techniques, a string is any sequence of alphanumeric characters (letters,numbers, spaces, tabs, and punctuation) For these purposes a space is just a characterlike any other, and we represent it with the symbol

Formally, a regular expression is an algebraic notation for characterizing a set ofstrings Thus they can be used to specify search strings as well as to define a language in

a formal way We will begin by talking about regular expressions as a way of specifyingsearches in texts, and proceed to other uses Section 2.3 shows that the use of justthree regular expression operators is sufficient to characterize strings, but we use themore convenient and commonly-used regular expression syntax of the Perl languagethroughout this section Since common text-processing programs agree on most of thesyntax of regular expressions, most of what we say extends to all UNIX, MicrosoftWord, and WordPerfect regular expressions Appendix A shows the few areas wherethese programs differ from the Perl syntax

Regular expression search requires a pattern that we want to search for, and a

cor-pus of texts to search through A regular expression search function will search through

we will assume that the search engine returns the line of the document returned This is

what the UNIXgrepcommand does We will underline the exact part of the patternthat matches the regular expression A search can be designed to return all matches to

a regular expression or only the first match We will show only the first match

2.1.1 Basic Regular Expression Patterns

The simplest kind of regular expression is a sequence of simple characters For

ex-ample, to search for woodchuck, we type/woodchuck/ So the regular expression/Buttercup/matches any string containing the substring Buttercup, for example the line I’m called little Buttercup) (recall that we are assuming a search application

that returns entire lines) From here on we will put slashes around each regular

Trang 22

expres-D R

sion to make it clear what is a regular expression and what is a pattern We use the

slash since this is the notation used by Perl, but the slashes are not part of the regular

expressions

The search string can consist of a single character (like /!/) or a sequence ofcharacters (like/urgl/); The first instance of each match to the regular expression isunderlined below (although a given application might choose to return more than justthe first instance):

/woodchucks/ “interesting links to woodchucks and lemurs”

/Claire says,/ “Dagmar, my gift please,” Claire says,”

/!/ “You’ve left the burglar behind again!” said Nori

Regular expressions are case sensitive; lowercase/s/is distinct from uppercase/S/(/s/matches a lower case s but not an uppercase S) This means that the pattern

/woodchucks/will not match the string Woodchucks We can solve this problem

with the use of the square braces[and] The string of characters inside the braces

specify a disjunction of characters to match For example Fig 2.1 shows that the

pattern/[wW]/matches patterns containing either w or W.

/[wW]oodchuck/ Woodchuck or woodchuck “Woodchuck”

Figure 2.1 The use of the brackets[]to specify a disjunction of characters

The regular expression/[1234567890]/specified any single digit While classes

of characters like digits or letters are important building blocks in expressions, they canget awkward (e.g., it’s inconvenient to specify

/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/

to mean “any capital letter”) In these cases the brackets can be used with the dash (-)

to specify any one character in a range The pattern/[2-5]/specifies any one of theRANGE

characters 2, 3, 4, or 5 The pattern/[b-g]/specifies one of the characters b, c, d, e,

f, or g Some other examples:

/[A-Z]/ an uppercase letter “we should call it ‘Drenched Blossoms’”/[a-z]/ a lowercase letter “my beans were impatient to be hoed!”/[0-9]/ a single digit “Chapter 1: Down the Rabbit Hole”

Figure 2.2 The use of the brackets[]plus the dash-to specify a range

The square braces can also be used to specify what a single character cannot be,

by use of the caretˆ If the caretˆis the first symbol after the open square brace[,

Trang 23

D R

the resulting pattern is negated For example, the pattern/[ˆa]/matches any single

character (including special characters) except a This is only true when the caret is the

first symbol after the open square brace If it occurs anywhere else, it usually standsfor a caret; Fig 2.3 shows some examples

RE Match (single characters) Example Patterns Matched[ˆA-Z] not an uppercase letter “Oyfn pripetchik”

[ˆSs] neither ‘S’ nor ‘s’ “I have no exquisite reason for’t”

Figure 2.3 Uses of the caretˆ for negation or just to meanˆ

The use of square braces solves our capitalization problem for woodchucks But

we still haven’t answered our original question; how do we specify both woodchuck and woodchucks? We can’t use the square brackets, because while they allow us to say

“s or S”, they don’t allow us to say “s or nothing” For this we use the question-mark/?/, which means “the preceding character or nothing”, as shown in Fig 2.4

woodchucks? woodchuck or woodchucks “woodchuck”

Figure 2.4 The question-mark?marks optionality of the previous expression

We can think of the question-mark as meaning “zero or one instances of the ous character” That is, it’s a way of specifying how many of something that we want

previ-So far we haven’t needed to specify that we want more than one of something Butsometimes we need regular expressions that allow repetitions of things For example,consider the language of (certain) sheep, which consists of strings that look like thefollowing:

This language consists of strings with a b, followed by at least two as, followed by

an exclamation point The set of operators that allow us to say things like “some

num-ber of as” are based on the asterisk or*, commonly called the Kleene * (pronounced

Trang 24

D R

meaning one a followed by zero or more as More complex patterns can also be

re-peated So/[ab]*/means “zero or more as or bs” (not “zero or more right square braces”) This will match strings like aaaa or ababab or bbbb.

We now know enough to specify part of our regular expression for prices: multipledigits Recall that the regular expression for an individual digit was/[0-9]/ So theregular expression for an integer (a string of digits) is/[0-9][0-9]*/ (Why isn’t

it just/[0-9]*/?)Sometimes it’s annoying to have to write the regular expression for digits twice, so

there is a shorter way to specify “at least one” of some character This is the Kleene +,

KLEENE +

which means “one or more of the previous character” Thus the expression/[0-9]+/

is the normal way to specify “a sequence of digits” There are thus two ways to specifythe sheep language:/baaa*!/or/baa+!/

One very important special character is the period (/./), a wildcard expression

that matches any single character (except a carriage return):

/beg.n/ any character between beg and n begin, beg’n, begun

Figure 2.5 The use of the period.to specify any character

The wildcard is often used together with the Kleene star to mean “any string ofcharacters” For example suppose we want to find any line in which a particular word,

for example aardvark, appears twice We can specify this with the regular expression

/aardvark.*aardvark/

Anchors are special characters that anchor regular expressions to particular places

ANCHORS

in a string The most common anchors are the caretˆand the dollar-sign$ The caret

ˆmatches the start of a line The pattern/ˆThe/matches the word The only at the

start of a line Thus there are three uses of the caretˆ: to match the start of a line, as

a negation inside of square brackets, and just to mean a caret (What are the contextsthat allow Perl to know which function a given caret is supposed to have?) The dollarsign$matches the end of a line So the pattern $is a useful pattern for matching

a space at the end of a line, and/ˆThe dog\.$/matches a line that contains only

the phrase The dog (We have to use the backslash here since we want the.to mean

“period” and not the wildcard.)There are also two other anchors:\bmatches a word boundary, while\Bmatches

a non-boundary Thus /\bthe\b/matches the word the but not the word other.

More technically, Perl defines a word as any sequence of digits, underscores or letters;this is based on the definition of “words” in programming languages like Perl or C Forexample,/\b99\b/will match the string 99 in There are 99 bottles of beer on the wall (because 99 follows a space) but not 99 in There are 299 bottles of beer on the wall (since 99 follows a number) But it will match 99 in$99 (since 99 follows a dollar

sign ($), which is not a digit, underscore, or letter)

Trang 25

D R

2.1.2 Disjunction, Grouping, and Precedence

Suppose we need to search for texts about pets; perhaps we are particularly interested

in cats and dogs In such a case we might want to search for either the string cat or the string dog Since we can’t use the square-brackets to search for “cat or dog” (why

not?) we need a new operator, the disjunction operator, also called the pipe symbol|.DISJUNCTION

The pattern/cat|dog/matches either the stringcator the stringdog

Sometimes we need to use this disjunction operator in the midst of a larger quence For example, suppose I want to search for information about pet fish for my

se-cousin David How can I specify both guppy and guppies? We cannot simply say

/guppy|ies/, because that would match only the strings guppy and ies This isbecause sequences like guppytake precedence over the disjunction operator| InPRECEDENCE

order to make the disjunction operator apply only to a specific pattern, we need to usethe parenthesis operators(and) Enclosing a pattern in parentheses makes it act like

a single character for the purposes of neighboring operators like the pipe| and theKleene* So the pattern/gupp(y|ies)/would specify that we meant the disjunc-tion only to apply to the suffixesyandies

The parenthesis operator ( is also useful when we are using counters like theKleene* Unlike the| operator, the Kleene*operator applies by default only to asingle character, not a whole sequence Suppose we want to match repeated instances

of a string Perhaps we have a line that has column labels of the form Column 1 umn 2 Column 3 The expression/Column [0-9]+ */will not match any col-umn; instead, it will match a column followed by any number of spaces! The star hereapplies only to the space that precedes it, not the whole sequence With the paren-theses, we could write the expression/(Column [0-9]+ *)*/to match the word

Col-Column, followed by a number and optional spaces, the whole pattern repeated any

number of times

This idea that one operator may take precedence over another, requiring us to

some-times use parentheses to specify what we mean, is formalized by the operator

prece-dence hierarchy for regular expressions The following table gives the order of RE

than disjunction,/the|any/matches the or any but not theny.

Patterns can be ambiguous in another way Consider the expression/[a-z]*/

when matching against the text once upon a time Since/[a-z]*/matches zero or

more letters, this expression could match nothing, or just the first letter o, or on, or onc,

or once In these cases regular expressions always match the largest string they can;

we say that patterns are greedy, expanding to cover as much of a string as they can.

GREEDY

Trang 26

D R

2.1.3 A Simple Example

Suppose we wanted to write a RE to find cases of the English article the A simple (but

incorrect) pattern might be:

/the/

One problem is that this pattern will miss the word when it begins a sentence and

hence is capitalized (i.e., The) This might lead us to the following pattern:

/[tT]he/

But we will still incorrectly return texts withtheembedded in other words (e.g.,

other or theology) So we need to specify that we want instances with a word boundary

on both sides:

/\b[tT]he\b/

Suppose we wanted to do this without the use of/\b/? We might want this since/\b/won’t treat underscores and numbers as word boundaries; but we might want to

find the in some context where it might also have underlines or numbers nearby (the

or the25) We need to specify that we want instances in which there are no alphabetic letters on either side of the the:

/[ˆa-zA-Z][tT]he[ˆa-zA-Z]/

But there is still one more problem with this pattern: it won’t find the word the

when it begins a line This is because the regular expression[ˆa-zA-Z], which we

used to avoid embedded thes, implies that there must be some single (although

non-alphabetic) character before thethe We can avoid this by specifying that before the

the we require either the beginning-of-line or a non-alphabetic character, and the same

at the end of the line:

/(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/

The process we just went through was based on fixing two kinds of errors: false

positives, strings that we incorrectly matched like other or there, and false negatives,

FALSE POSITIVES

FALSE NEGATIVES strings that we incorrectly missed, like The Addressing these two kinds of errors

comes up again and again in building and improving speech and language processingsystems Reducing the error rate for an application thus involves two antagonisticefforts:

• Increasing accuracy (minimizing false positives)

• Increasing coverage (minimizing false negatives).

2.1.4 A More Complex Example

Let’s try out a more significant example of the power of REs Suppose we want to build

an application to help a user buy a computer on the Web The user might want “any PCwith more than 500 MHz and 32 Gb of disk space for less than $1000” In order to do

this kind of retrieval we will first need to be able to look for expressions like 500 MHz

Trang 27

D R

or 32 Gb or Compaq or Mac or$999.99 In the rest of this section we’ll work out some

simple regular expressions for this task

First, let’s complete our regular expression for prices Here’s a regular expressionfor a dollar sign followed by a string of digits Note that Perl is smart enough to realizethat$here doesn’t mean end-of-line; how might it know that?

/$[0-9]+/

Now we just need to deal with fractions of dollars We’ll add a decimal point andtwo digits afterwards:

/$[0-9]+\.[0-9][0-9]/

This pattern only allows$199.99 but not$199 We need to make the cents optional,

and make sure we’re at a word boundary:

/\b$[0-9]+(\.[0-9][0-9])?\b/

How about specifications for processor speed (in megahertz = MHz or gigahertz =GHz)? Here’s a pattern for that:

/\b[0-9]+ *(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b/

Note that we use/ */to mean “zero or more spaces”, since there might always

be extra spaces lying around Dealing with disk space (in Gb = gigabytes), or memorysize (in Mb = megabytes or Gb = gigabytes), we need to allow for optional gigabyte

fractions again (5.5 Gb) Note the use of?for making the finalsoptional:

\w [a-zA-Z0-9_] any alphanumeric or underscore Daiyu

Trang 28

D R

There are also some useful advanced regular expression operators Fig 2.6 showssome useful aliases for common ranges, which can be used mainly to save typing.Besides the Kleene * and Kleene +, we can also use explicit numbers as counters, byenclosing them in curly brackets The regular expression /{3}/means “exactly 3occurrences of the previous character or expression” So/a\.{24}z/will match a followed by 24 dots followed by z (but not a followed by 23 or 25 dots followed by a z).

A range of numbers can also be specified; so/{n,m}/specifies from n to m rences of the previous char or expression, while/{n,}/means at least n occurrences

occur-of the previous expression REs for counting are summarized in Figure 2.7

{n,m} from n to m occurrences of the previous char or expression

{n,} at least n occurrences of the previous char or expression

Figure 2.7 Regular expression operators for counting

Finally, certain special characters are referred to by special notation based on thebackslash (\) The most common of these are the newline character\nand the tab

\ a period “.” “Dr Livingston, I presume”

\? a question mark “Why don’t they come and lend a hand?”

\n a newline

\t a tab

Figure 2.8 Some characters that need to be backslashed

The reader should consult Appendix A for further details of regular expressions,and especially for the differences between regular expressions in Perl, UNIX, and Mi-crosoft Word

2.1.6 Regular Expression Substitution, Memory, and ELIZA

An important use of regular expressions is in substitutions For example, the Perl

sub-SUBSTITUTION

stitution operators/regexp1/pattern/allows a string characterized by a regularexpression to be replaced by another string:

s/colour/color/

Trang 29

D R

It is often useful to be able to refer to a particular subpart of the string matching thefirst pattern For example, suppose we wanted to put angle brackets around all integers

in a text, changing e.g., the 35 boxes to the <35> boxes We’d like a way to refer back

to the integer we’ve found so that we can easily add the brackets To do this, we putparentheses(and)around the first pattern, and use the number operator\1in thesecond pattern to refer back Here’s how it looks:

s/([0-9]+)/<\1>/

The parenthesis and number operators can also be used to specify that a certainstring or expression must occur twice in the text For example, suppose we are lookingfor the pattern “the Xer they were, the Xer they will be”, where we want to constrain thetwo X’s to be the same string We do this by surrounding the first X with the parenthesisoperator, and replacing the second X with the number operator\1, as follows:/the (.*)er they were, the \1er they will be/

Here the\1will be replaced by whatever string matched the first item in parentheses

So this will match The bigger they were, the bigger they will be but not The bigger they were, the faster they will be.

The number operator can be used with other numbers: if you match two differentsets of parenthesis,\2means whatever matched the second set For example

/the (.*)er they (.*), the \1er they \2/

will match The bigger they were, the bigger they were but not The bigger they were,

the bigger they will be These numbered memories are called registers (e.g register 1,

REGISTERS

register 2, register 3, etc) This memory feature is not part of every regular expressionlanguage and is often considered an “extended” feature of regular expressions.Substitutions using memory are very useful in implementing a simple natural-language understanding program like ELIZA (Weizenbaum, 1966) Recall that ELIZAsimulated a Rogerian psychologist and could carry on conversations with the user likethe following:

User1: Men are all alike

ELIZA1: IN WHAT WAYUser2: They’re always bugging us about something or other

ELIZA2: CAN YOU THINK OF A SPECIFIC EXAMPLEUser3: Well, my boyfriend made me come here

ELIZA3: YOUR BOYFRIEND MADE YOU COME HEREUser4: He says I’m depressed much of the time

ELIZA4: I AM SORRY TO HEAR YOU ARE DEPRESSED

Eliza worked by having a cascade of regular expression substitutions that eachmatched some part of the input lines and changed them The first substitutions changed

all instances of my to YOUR, and I’m to YOU ARE, and so on The next set of

substi-tutions looked for relevant patterns in the input and created an appropriate output; hereare some examples:

s/.* YOU ARE (depressed|sad) */I AM SORRY TO HEAR YOU ARE \1/s/.* YOU ARE (depressed|sad) */WHY DO YOU THINK YOU ARE \1/

Trang 30

D R

s/.* all */IN WHAT WAY/

s/.* always */CAN YOU THINK OF A SPECIFIC EXAMPLE/

Since multiple substitutions could apply to a given input, substitutions were signed a rank and were applied in order Creation of such patterns is addressed inExercise 2.2

The regular expression is more than just a convenient metalanguage for text searching

First, a regular expression is one way of describing a finite-state automaton (FSA).

FINITE-STATE

AUTOMATON

FSA Finite-state automata are the theoretical foundation of a good deal of the computationalwork we will describe in this book Any regular expression can be implemented as afinite-state automaton (except regular expressions that use the memory feature; more

on this later) Symmetrically, any finite-state automaton can be described with a regularexpression Second, a regular expression is one way of characterizing a particular kind

of formal language called a regular language Both regular expressions and

finite-REGULAR LANGUAGE

state automata can be used to describe regular languages A third equivalent method

of characterizing the regular languages, the regular grammar, will be introduced in

Ch 15 The relation among these four theoretical constructions is sketched out inFig 2.9

regular grammars

finite automata

regular expressions

regular languages

Figure 2.9 Finite automata, regular expressions, and regular grammars are all lent ways of describing regular languages

equiva-This section will begin by introducing finite-state automata for some of the lar expressions from the last section, and then suggest how the mapping from regularexpressions to automata proceeds in general Although we begin with their use forimplementing regular expressions, FSAs have a wide variety of other uses that we willexplore in this chapter and the next

regu-2.2.1 Using an FSA to Recognize Sheeptalk

After a while, with the parrot’s help, the Doctor got to learn the language of the animals so well that he could talk to them himself and understand everything they said.

Trang 31

D R

Hugh Lofting, The Story of Doctor Dolittle

Let’s begin with the “sheep language” we discussed previously Recall that wedefined the sheep language as any string from the following (infinite) set:

Figure 2.10 A finite-state automaton for talking sheep

The regular expression for this kind of “sheeptalk” is/baa+!/ Fig 2.10 shows

an automaton for modeling this regular expression The automaton (i.e., machine,

AUTOMATON

also called finite automaton, finite-state automaton, or FSA) recognizes a set of

strings, in this case the strings characterizing sheep talk, in the same way that a regularexpression does We represent the automaton as a directed graph: a finite set of vertices(also called nodes), together with a set of directed links between pairs of vertices calledarcs We’ll represent vertices with circles and arcs with arrows The automaton has five

statess, which are represented by nodes in the graph State 0 is the start state In our

STATES

START STATE examples state 0 will generally be the start state; to mark another state as the start state

we can add an incoming arrow to the start state State 4 is the final state or accepting

state, which we represent by the double circle It also has four transitions, which we

represent by arcs in the graph

The FSA can be used for recognizing (we also say accepting) strings in the

follow-ing way First, think of the input as befollow-ing written on a long tape broken up into cells,with one symbol written in each cell of the tape, as in Fig 2.11

q0

Figure 2.11 A tape with cells

The machine starts in the start state (q0), and iterates the following process: Checkthe next letter of the input If it matches the symbol on an arc leaving the currentstate, then cross that arc, move to the next state, and also advance one symbol in the

Trang 32

D R

input If we are in the accepting state (q4) when we run out of input, the machine hassuccessfully recognized an instance of sheeptalk If the machine never gets to the finalstate, either because it runs out of input, or it gets some input that doesn’t match an arc(as in Fig 2.11), or if it just happens to get stuck in some non-final state, we say the

machine rejects or fails to accept an input.

InputState b a !

Figure 2.12 The state-transition table for the FSA of Figure 2.10

We’ve marked state 4 with a colon to indicate that it’s a final state (you can have asmany final states as you want), and the/0indicates an illegal or missing transition We

can read the first row as “if we’re in state 0 and we see the input b we must go to state

1 If we’re in state 0 and we see the input a or !, we fail”.

More formally, a finite automaton is defined by the following five parameters:

Q = q0q1q2 .q N−1 a finite set of N states

δ(q, i) the transition function or transition matrix

be-tween states Given a state q ∈ Q and an input

symbol i∈Σ,δ(q, i) returns a new state q′∈ Q.δ

is thus a relation from Q×Σto Q;

For the sheeptalk automaton in Fig 2.10, Q = {q0,q1,q2,q3,q4}, Σ= {a, b, !},

F = {q4}, andδ(q, i) is defined by the transition table in Fig 2.12

Figure 2.13 presents an algorithm for recognizing a string using a state-transitiontable The algorithm is calledD-RECOGNIZEfor “deterministic recognizer” A deter-

ministic algorithm is one that has no choice points; the algorithm always knows what

DETERMINISTIC

to do for any input The next section will introduce non-deterministic automata thatmust make decisions about which states to move to

D-RECOGNIZEtakes as input a tape and an automaton It returns accept if the string

it is pointing to on the tape is accepted by the automaton, and reject otherwise Note

that sinceD-RECOGNIZEassumes it is already pointing at the string to be checked, itstask is only a subpart of the general problem that we often use regular expressions for,

Trang 33

D R

finding a string in a corpus (The general problem is left as an exercise to the reader inExercise 2.9.)

D-RECOGNIZEbegins by setting the variable index to the beginning of the tape, and current-state to the machine’s initial state.D-RECOGNIZEthen enters a loop that drivesthe rest of the algorithm It first checks whether it has reached the end of its input If

so, it either accepts the input (if the current state is an accept state) or rejects the input(if not)

If there is input left on the tape,D-RECOGNIZElooks at the transition table to decide

which state to move to The variable current-state indicates which row of the table to

consult, while the current symbol on the tape indicates which column of the table to

consult The resulting transition-table cell is used to update the variable current-state and index is incremented to move forward on the tape If the transition-table cell is

empty then the machine has nowhere to go and must reject the input

function D-RECOGNIZE(tape, machine) returns accept or reject

index← Beginning of tape

current-state← Initial state of machine

loop

if End of input has been reached then

if current-state is an accept state then return accept

else return reject

elsif transition-table[current-state,tape[index]] is empty then

return reject else

current-state ← transition-table[current-state,tape[index]]

index ← index + 1

end

Figure 2.13 An algorithm for deterministic recognition of FSAs This algorithm returns

accept if the entire string it is pointing at is in the language defined by the FSA, and reject

if the string is not in the language

Figure 2.14 traces the execution of this algorithm on the sheep language FSA given

the sample input string baaa!.

q0 q1 q2 q3 q3 q4

Figure 2.14 Tracing the execution of FSA #1 on some sheeptalk

Trang 34

D R

Before examining the beginning of the tape, the machine is in state q0 Finding a b

on input tape, it changes to state q1as indicated by the contents of transition-table[q0,b]

in Fig 2.12 on page 13 It then finds an a and switches to state q2, another a puts it in state q3, a third a leaves it in state q3, where it reads the “!”, and switches to state q4.Since there is no more input, theEnd of inputcondition at the beginning of the

loop is satisfied for the first time and the machine halts in q4 State q4is an accepting

state, and so the machine has accepted the string baaa! as a sentence in the sheep

language

The algorithm will fail whenever there is no legal transition for a given combination

of state and input The input abc will fail to be recognized since there is no legal transition out of state q0on the inputa, (i.e., this entry of the transition table in Fig 2.12

on page 13 has a /0) Even if the automaton had allowed an initial a it would have certainly failed on c, since c isn’t even in the sheeptalk alphabet! We can think of these

“empty” elements in the table as if they all pointed at one “empty” state, which we

might call the fail state or sink state In a sense then, we could view any machine with

FAIL STATE

empty transitions as if we had augmented it with a fail state, and drawn in all the extra

arcs, so we always had somewhere to go from any state on any possible input Just for

completeness, Fig 2.15 shows the FSA from Figure 2.10 with the fail state q Ffilled in

a b

Figure 2.15 Adding a fail state to Fig 2.10

2.2.2 Formal Languages

We can use the same graph in Fig 2.10 as an automaton forGENERATINGsheeptalk

If we do, we would say that the automaton starts at state q0, and crosses arcs to newstates, printing out the symbols that label each arc it follows When the automaton gets

to the final state it stops Notice that at state 3, the automaton has to chose between

printing out a ! and going to state 4, or printing out an a and returning to state 3 Let’s

say for now that we don’t care how the machine makes this decision; maybe it flips acoin For now, we don’t care which exact string of sheeptalk we generate, as long asit’s a string captured by the regular expression for sheeptalk above

Formal Language: A model which can both generate and recognize all

and only the strings of a formal language acts as a definition of the formal

language

Trang 35

D R

A formal language is a set of strings, each string composed of symbols from a

FORMAL LANGUAGE

finite symbol-set called an alphabet (the same alphabet used above for defining an

ALPHABET

automaton!) The alphabet for the sheep language is the set Σ= {a, b, !} Given a

model m (such as a particular FSA), we can use L (m) to mean “the formal language

characterized by m” So the formal language defined by our sheeptalk automaton m in

Fig 2.10 (and Fig 2.12) is the infinite set:

L (m) = {baa!, baaa!, baaaa!, baaaaa!, baaaaaa!, }

phonology, morphology, or syntax The term generative grammar is sometimes used

in linguistics to mean a grammar of a formal language; the origin of the term is this use

of an automaton to define a language by generating all possible strings

2.2.3 Another Example

In the previous examples our formal alphabet consisted of letters; but we can alsohave a higher level alphabet consisting of words In this way we can write finite-stateautomata that model facts about word combinations For example, suppose we wanted

to build an FSA that modeled the subpart of English dealing with amounts of money.Such a formal language would model the subset of English consisting of phrases like

ten cents, three dollars, one dollar thirty-five cents and so on.

We might break this down by first building just the automaton to account for thenumbers from 1 to 99, since we’ll need them to deal with cents Fig 2.16 shows this

twentythirtyfortyfifty

sixtyseventyeightyninety

onetwothreefourfive

sixseveneightnine

onetwothreefourfive

sixseveneightnineten

eleventwelvethirteenfourteen

fifteensixteenseventeeneighteennineteen

Figure 2.16 An FSA for the words for English numbers 1–99

We could now add cents and dollars to our automaton Fig 2.17 shows a simple

version of this, where we just made two copies of the automaton in Fig 2.16 and

Trang 36

six seven eight nine

q3

twenty thirty forty fifty

sixty seventy eighty ninety

one two three four five

sixteen seventeen eighteen nineteen

ten twenty thirty forty fifty

sixty seventy eighty ninety

sixteen seventeen eighteen nineteen

eleven twelve thirteen fourteen fifteen

Figure 2.17 FSA for the simple dollars and cents

We would now need to add in the grammar for different amounts of dollars;

in-cluding higher numbers like hundred, thousand We’d also need to make sure that the nouns like cents and dollars are singular when appropriate (one cent, one dollar), and plural when appropriate (ten cents, two dollars) This is left as an exercise for the

reader (Exercise 2.3) We can think of the FSAs in Fig 2.16 and Fig 2.17 as simplegrammars of parts of English We will return to grammar-building in Part II of thisbook, particularly in Ch 12

2.2.4 Non-Deterministic FSAsLet’s extend our discussion now to another class of FSAs: non-deterministic FSAs (or NFSAs) Consider the sheeptalk automaton in Figure 2.18, which is much like our

first automaton in Figure 2.10:

as an automaton for recognizing sheeptalk When we get to state 2, if we see an a we

don’t know whether to remain in state 2 or go on to state 3 Automata with decision

points like this are called non-deterministic FSAs (or NFSAs) Recall by contrast

NON-DETERMINISTIC

NFSA that Figure 2.10 specified a deterministic automaton, i.e., one whose behavior during

recognition is fully determined by the state it is in and the symbol it is looking at A

deterministic automaton can be referred to as a DFSA That is not true for the machine

Trang 37

D R

same language as the last one, or our first one, but it does it with anε-transition

Figure 2.19 Another NFSA for the sheep language (NFSA #2) It differs from NFSA

#1 in Fig 2.18 in having anε-transition

We interpret this new arc as follows: If we are in state 3, we are allowed to move

to state 2 without looking at the input, or advancing our input pointer So this

intro-duces another kind of non-determinism — we might not know whether to follow the

ε-transition or the ! arc.

2.2.5 Using an NFSA to Accept Strings

If we want to know whether a string is an instance of sheeptalk or not, and if we use anon-deterministic machine to recognize it, we might follow the wrong arc and reject itwhen we should have accepted it That is, since there is more than one choice at somepoint, we might take the wrong choice This problem of choice in non-deterministicmodels will come up again and again as we build computational models, particularly

for parsing There are three standard solutions to the problem of non-determinism:

• Backup: Whenever we come to a choice point, we could put a marker to mark

alternative path in parallel

We will focus here on the backup approach and defer discussion of the look-aheadand parallelism approaches to later chapters

The backup approach suggests that we should blithely make choices that mightlead to deadends, knowing that we can always return to unexplored alternative choices.There are two keys to this approach: we need to remember all the alternatives for eachchoice point, and we need to store sufficient information about each alternative so that

we can return to it when necessary When a backup algorithm reaches a point in itsprocessing where no progress can be made (because it runs out of input, or has nolegal transitions), it returns to a previous choice point, selects one of the unexploredalternatives, and continues from there Applying this notion to our non-deterministicrecognizer, we need only remember two things for each choice point: the state, or node,

of the machine that we can go to and the corresponding position on the tape We will

call the combination of the node and position the search-state of the recognition

Trang 38

algo-D R

InputState b a ! ε

Figure 2.20 The transition table from NFSA #1 in Fig 2.18

rithm To avoid confusion, we will refer to the state of the automaton (as opposed to the

state of the search) as a node or a machine-state Figure 2.21 presents a recognition

algorithm based on this approach

Before going on to describe the main part of this algorithm, we should note twochanges to the transition table that drives it First, in order to represent nodes that haveoutgoingε-transitions, we add a newε-column to the transition table If a node has

anε-transition, we list the destination node in theε-column for that node’s row Thesecond addition is needed to account for multiple transitions to different nodes fromthe same input symbol We let each cell entry consist of a list of destination nodesrather than a single node Fig 2.20 shows the transition table for the machine in Figure2.18 (NFSA #1) While it has noε-transitions, it does show that in machine-state q2the input a can lead back to q2or on to q3

Fig 2.21 shows the algorithm for using a non-deterministic FSA to recognize aninput string The functionND-RECOGNIZEuses the variable agenda to keep track of

all the currently unexplored choices generated during the course of processing Eachchoice (search state) is a tuple consisting of a node (state) of the machine and a posi-

tion on the tape The variable current-search-state represents the branch choice being

currently explored

ND-RECOGNIZE begins by creating an initial search-state and placing it on theagenda For now we don’t specify what order the search-states are placed on theagenda This search-state consists of the initial machine-state of the machine and apointer to the beginning of the tape The functionNEXTis then called to retrieve an

item from the agenda and assign it to the variable current-search-state.

As withD-RECOGNIZE, the first task of the main loop is to determine if the tire contents of the tape have been successfully recognized This is done via a call

en-toACCEPT-STATE?, which returns accept if the current search-state contains both an

accepting machine-state and a pointer to the end of the tape If we’re not done, themachine generates a set of possible next steps by calling GENERATE-NEW-STATES,which creates search-states for anyε-transitions and any normal input-symbol transi-tions from the transition table All of these search-state tuples are then added to thecurrent agenda

Finally, we attempt to get a new search-state to process from the agenda If theagenda is empty we’ve run out of options and have to reject the input Otherwise, anunexplored option is selected and the loop continues

It is important to understand whyND-RECOGNIZEreturns a value of reject onlywhen the agenda is found to be empty UnlikeD-RECOGNIZE, it does not return reject

Trang 39

D R

when it reaches the end of the tape in a non-accept machine-state or when it findsitself unable to advance the tape from some machine-state This is because, in the non-deterministic case, such roadblocks only indicate failure down a given path, not overallfailure We can only be sure we can reject a string when all possible choices have beenexamined and found lacking

function ND-RECOGNIZE(tape, machine) returns accept or reject

agenda← {(Initial state of machine, beginning of tape)}

agenda ← agenda ∪ GENERATE-NEW-STATES(current-search-state)

if agenda is empty then

return reject else

current-search-state← NEXT(agenda)

end function GENERATE-NEW-STATES(current-state) returns a set of search-states

current-node← the node the current search-state is in

index← the point on the tape the current search-state is looking at

return a list of search states from transition table as follows:

(transition-table[current-node,ε], index)

∪

(transition-table[current-node, tape[index]], index + 1)

function ACCEPT-STATE?(search-state) returns true or false

current-node← the node search-state is in

index← the point on the tape search-state is looking at

if index is at the end of the tape and current-node is an accept state of machine

then return true else return false

Figure 2.21 An algorithm for NFSA recognition The word node means a state of the FSA, while state or search-state means “the state of the search process”, i.e., a combination

of node and tape-position.

Figure 2.22 illustrates the progress ofND-RECOGNIZEas it attempts to handle theinputbaaa! Each strip illustrates the state of the algorithm at a given point in its

processing The current-search-state variable is captured by the solid bubbles

repre-senting the machine-state along with the arrow reprerepre-senting progress on the tape Each

strip lower down in the figure represents progress from one current-search-state to the

Trang 40

Little of interest happens until the algorithm finds itself in state q2while looking at

the second a on the tape An examination of the entry for transition-table[q2,a] returns

both q2and q3 Search states are created for each of these choices and placed on the

agenda Unfortunately, our algorithm chooses to move to state q3, a move that results

in neither an accept state nor any new states since the entry for transition-table[q3, a]

is empty At this point, the algorithm simply asks the agenda for a new state to pursue

Since the choice of returning to q2from q2is the only unexamined choice on the agenda

it is returned with the tape pointer advanced to the next a Somewhat diabolically,ND

-RECOGNIZEfinds itself faced with the same choice The entry for transition-table[q2,a]

still indicates that looping back to q2or advancing to q3are valid choices As before,states representing both are placed on the agenda These search states are not the same

as the previous ones since their tape index values have advanced This time the agenda

provides the move to q3as the next move The move to q4, and success, is then uniquelydetermined by the tape and the transition-table

Tiêu đề	Speech and Language Processing: An Introduction to Natural Language Processing
Tác giả	Daniel Jurafsky, James H. Martin
Trường học	Stanford University
Chuyên ngành	Natural Language Processing
Thể loại	draft
Năm xuất bản	2006
Thành phố	Stanford

Định dạng
Số trang	509
Dung lượng	7,99 MB