Clarke, screenplay of 2001: A Space Odyssey This book is about a new interdisciplinary field variously called computer speech and language processing or human language technology or natu
Trang 1D R
computational linguistics, and speech recognition Daniel Jurafsky & James H Martin Copyright c 2006, All rights reserved Draft of June 25, 2007 Do not cite without permission.
Dave Bowman: Open the pod bay doors, HAL HAL: I’m sorry Dave, I’m afraid I can’t do that.
Stanley Kubrick and Arthur C Clarke,
screenplay of 2001: A Space Odyssey
This book is about a new interdisciplinary field variously called computer speech
and language processing or human language technology or natural language cessing or computational linguistics The goal of this new field is to get computers
pro-to perform useful tasks involving human language, tasks like enabling human-machinecommunication, improving human-human communication, or simply doing useful pro-cessing of text or speech
One example of a useful such task is a conversational agent The HAL 9000
com-CONVERSATIONAL
AGENT
puter in Stanley Kubrick’s film 2001: A Space Odyssey is one of the most recognizable
characters in twentieth-century cinema HAL is an artificial agent capable of such vanced language-processing behavior as speaking and understanding English, and at acrucial moment in the plot, even reading lips It is now clear that HAL’s creator Arthur
ad-C Clarke was a little optimistic in predicting when an artificial agent such as HALwould be available But just how far off was he? What would it take to create at leastthe language-related parts of HAL? We call programs like HAL that converse with hu-
mans via natural language conversational agents or dialogue systems In this text we
CONVERSATIONAL
AGENTS
DIALOGUE SYSTEMS study the various components that make up modern conversational agents, including
language input (automatic speech recognition and natural language
understand-ing) and language output (natural language generation and speech synthesis).
Let’s turn to another useful language-related task, that of making available to English-speaking readers the vast amount of scientific information on the Web in En-glish Or translating for English speakers the hundreds of millions of Web pages written
non-in other languages like Chnon-inese The goal of machnon-ine translation is to automatically
MACHINE
TRANSLATION
translate a document from one language to another Machine translation is far from
a solved problem; we will cover the algorithms currently used in the field, as well asimportant component tasks
Many other language processing tasks are also related to the Web Another such
task is Web-based question answering This is a generalization of simple web search,
QUESTION
ANSWERING
where instead of just typing keywords a user might ask complete questions, rangingfrom easy to hard, like the following:
Trang 2D R
• What does “divergent” mean?
• What year was Abraham Lincoln born?
• How many states were in the United States that year?
• How much Chinese silk was exported to England by the end of the 18th century?
• What do scientists think about the ethics of human cloning?
Some of these, such as definition questions, or simple factoid questions like dates
and locations, can already be answered by search engines But answering more plicated questions might require extracting information that is embedded in other text
com-on a Web page, or doing inference (drawing ccom-onclusicom-ons based com-on known facts), or
synthesizing and summarizing information from multiple sources or web pages In thistext we study the various components that make up modern understanding systems of
this kind, including information extraction, word sense disambiguation, and so on.
Although the subfields and problems we’ve described above are all very far fromcompletely solved, these are all very active research areas and many technologies arealready available commercially In the rest of this chapter we briefly summarize the
kinds of knowledge that is necessary for these tasks (and others like spell correction,
grammar checking, and so on), as well as the mathematical models that will be
intro-duced throughout the book
What distinguishes language processing applications from other data processing
sys-tems is their use of knowledge of language Consider the Unixwcprogram, which isused to count the total number of bytes, words, and lines in a text file When used tocount bytes and lines,wcis an ordinary data processing application However, when it
is used to count the words in a file it requires knowledge about what it means to be a word, and thus becomes a language processing system.
Of course,wcis an extremely simple system with an extremely limited and poverished knowledge of language Sophisticated conversational agents like HAL,
im-or machine translation systems, im-or robust question-answering systems, require muchbroader and deeper knowledge of language To get a feeling for the scope and kind ofrequired knowledge, consider some of what HAL would need to know to engage in thedialogue that begins this chapter, or for a question answering system to answer one ofthe questions above
HAL must be able to recognize words from an audio signal and to generate an
audio signal from a sequence of words These tasks of speech recognition and speech
synthesis tasks require knowledge about phonetics and phonology; how words are
pronounced in terms of sequences of sounds, and how each of these sounds is realizedacoustically
Note also that unlike Star Trek’s Commander Data, HAL is capable of producing
contractions like I’m and can’t Producing and recognizing these and other variations
of individual words (e.g., recognizing that doors is plural) requires knowledge about
morphology, the way words break down into component parts that carry meanings like
singular versus plural.
Trang 3D R
Moving beyond individual words, HAL must use structural knowledge to properlystring together the words that constitute its response For example, HAL must knowthat the following sequence of words will not make sense to Dave, despite the fact that
it contains precisely the same set of words as the original
I’m I do, sorry that afraid Dave I’m can’t
The knowledge needed to order and group words together comes under the heading of
syntax.
Now consider a question answering system dealing with the following question:
• How much Chinese silk was exported to Western Europe by the end of the 18th
century?
In order to answer this question we need to know something about lexical
seman-tics, the meaning of all the words (export, or silk) as well as compositional semantics
(what exactly constitutes Western Europe as opposed to Eastern or Southern Europe, what does end mean when combined with the 18th century We also need to know
something about the relationship of the words to the syntactic structure For example
we need to know that by the end of the 18th century is a temporal end-point, and not a
description of the agent, as the by-phrase is in the following sentence:
• How much Chinese silk was exported to Western Europe by southern merchants?
We also need the kind of knowledge that lets HAL determine that Dave’s utterance
is a request for action, as opposed to a simple statement about the world or a questionabout the door, as in the following variations of his original statement
STATEMENT: HAL, the pod bay door is open.
INFORMATION QUESTION: HAL, is the pod bay door open?
Next, despite its bad behavior, HAL knows enough to be polite to Dave It could,
for example, have simply replied No or No, I won’t open the door Instead, it first embellishes its response with the phrases I’m sorry and I’m afraid, and then only indi- rectly signals its refusal by saying I can’t, rather than the more direct (and truthful) I won’t.1 This knowledge about the kind of actions that speakers intend by their use of
sentences is pragmatic or dialogue knowledge.
Another kind of pragmatic or discourse knowledge is required to answer the
ques-tion
• How many states were in the United States that year?
What year is that year? In order to interpret words like that year a question
answer-ing system need to examine the the earlier questions that were asked; in this case the
previous question talked about the year that Lincoln was born Thus this task of
coref-erence resolution makes use of knowledge about how words like that or pronouns like
it or she refer to previous parts of the discourse.
To summarize, engaging in complex language behavior requires various kinds ofknowledge of language:
1 For those unfamiliar with HAL, it is neither sorry nor afraid, nor is it incapable of opening the door It has simply decided in a fit of paranoia to kill its crew.
Trang 4D R
• Phonetics and Phonology — knowledge about linguistic sounds
• Morphology — knowledge of the meaningful components of words
• Syntax — knowledge of the structural relationships between words
• Semantics — knowledge of meaning
• Pragmatics — knowledge of the relationship of meaning to the goals and
inten-tions of the speaker
• Discourse — knowledge about linguistic units larger than a single utterance
A perhaps surprising fact about these categories of linguistic knowledge is that most
tasks in speech and language processing can be viewed as resolving ambiguity at one
AMBIGUITY
of these levels We say some input is ambiguous if there are multiple alternative
lin-AMBIGUOUS
guistic structures that can be built for it Consider the spoken sentence I made her duck.
Here’s five different meanings this sentence could have (see if you can think of somemore), each of which exemplifies an ambiguity at some level:
(1.1) I cooked waterfowl for her
(1.2) I cooked waterfowl belonging to her
(1.3) I created the (plaster?) duck she owns
(1.4) I caused her to quickly lower her head or body
(1.5) I waved my magic wand and turned her into undifferentiated waterfowl
These different meanings are caused by a number of ambiguities First, the words duck and her are morphologically or syntactically ambiguous in their part-of-speech Duck can be a verb or a noun, while her can be a dative pronoun or a possessive pronoun Second, the word make is semantically ambiguous; it can mean create or cook Finally, the verb make is syntactically ambiguous in a different way Make can be transitive,
that is, taking a single direct object (1.2), or it can be ditransitive, that is, taking two
objects (1.5), meaning that the first object (her) got made into the second object (duck) Finally, make can take a direct object and a verb (1.4), meaning that the object (her) got caused to perform the verbal action (duck) Furthermore, in a spoken sentence, there
is an even deeper kind of ambiguity; the first word could have been eye or the second word maid.
We will often introduce the models and algorithms we present throughout the book
as ways to resolve or disambiguate these ambiguities For example deciding whether
duck is a verb or a noun can be solved by part-of-speech tagging Deciding whether make means “create” or “cook” can be solved by word sense disambiguation Reso-
lution of part-of-speech and word sense ambiguities are two important kinds of lexical
disambiguation A wide variety of tasks can be framed as lexical disambiguation
problems For example, a text-to-speech synthesis system reading the word lead needs
to decide whether it should be pronounced as in lead pipe or as in lead me on By contrast, deciding whether her and duck are part of the same entity (as in (1.1) or (1.4))
or are different entity (as in (1.2)) is an example of syntactic disambiguation and can
Trang 5D R
be addressed by probabilistic parsing Ambiguities that don’t arise in this
particu-lar example (like whether a given sentence is a statement or a question) will also be
resolved, for example by speech act interpretation.
One of the key insights of the last 50 years of research in language processing is thatthe various kinds of knowledge described in the last sections can be captured throughthe use of a small number of formal models, or theories Fortunately, these models andtheories are all drawn from the standard toolkits of computer science, mathematics, andlinguistics and should be generally familiar to those trained in those fields Among the
most important models are state machines, rule systems, logic, probabilistic models, and vector-space models These models, in turn, lend themselves to a small number
of algorithms, among the most important of which are state space search algorithms such as dynamic programming, and machine learning algorithms such as classifiers and EM and other learning algorithms.
In their simplest formulation, state machines are formal models that consist ofstates, transitions among states, and an input representation Some of the variations
of this basic model that we will consider are deterministic and non-deterministic
finite-state automata and finite-state transducers.
Closely related to these models are their declarative counterparts: formal rule
sys-tems Among the more important ones we will consider are regular grammars and
regular relations, context-free grammars, feature-augmented grammars, as well
as probabilistic variants of them all State machines and formal rule systems are themain tools used when dealing with knowledge of phonology, morphology, and syntax.The third model that plays a critical role in capturing knowledge of language is
logic We will discuss first order logic, also known as the predicate calculus, as well
as such related formalisms as lambda-calculus, feature-structures, and semantic tives These logical representations have traditionally been used for modeling seman-tics and pragmatics, although more recent work has focused on more robust techniquesdrawn from non-logical lexical semantics
primi-Probabilistic models are crucial for capturing every kind of linguistic knowledge.Each of the other models (state machines, formal rule systems, and logic) can be aug-mented with probabilities For example the state machine can be augmented with
probabilities to become the weighted automaton or Markov model We will spend
a significant amount of time on hidden Markov models or HMMs, which are used
everywhere in the field, in part-of-speech tagging, speech recognition, dialogue standing, text-to-speech, and machine translation The key advantage of probabilisticmodels is their ability to to solve the many kinds of ambiguity problems that we dis-cussed earlier; almost any speech and language processing problem can be recast as:
under-“given N choices for some ambiguous input, choose the most probable one”.
Finally, vector-space models, based on linear algebra, underlie information retrievaland many treatments of word meanings
Processing language using any of these models typically involves a search through
Trang 6D R
a space of states representing hypotheses about an input In speech recognition, wesearch through a space of phone sequences for the correct word In parsing, we searchthrough a space of trees for the syntactic parse of an input sentence In machine trans-lation, we search through a space of translation hypotheses for the correct translation of
a sentence into another language For non-probabilistic tasks, such as state machines,
we use well-known graph algorithms such as depth-first search For probabilistic tasks, we use heuristic variants such as best-first and A* search, and rely on dynamic
programming algorithms for computational tractability
For many language tasks, we rely on machine learning tools like classifiers and
sequence models Classifiers like decision trees, support vector machines, Gaussian Mixture Models and logistic regression are very commonly used A hidden Markov
model is one kind of sequence model; other are Maximum Entropy Markov Models
or Conditional Random Fields.
Another tool that is related to machine learning is methodological; the use of
dis-tinct training and test sets, statistical techniques like cross-validation, and careful
eval-uation of our trained systems
To many, the ability of computers to process language as skillfully as we humans dowill signal the arrival of truly intelligent machines The basis of this belief is the factthat the effective use of language is intertwined with our general cognitive abilities.Among the first to consider the computational implications of this intimate connectionwas Alan Turing (1950) In this famous paper, Turing introduced what has come to be
known as the Turing Test Turing began with the thesis that the question of what it
TURING TEST
would mean for a machine to think was essentially unanswerable due to the inherent
imprecision in the terms machine and think Instead, he suggested an empirical test, a
game, in which a computer’s use of language would form the basis for determining if
it could think If the machine could win the game it would be judged intelligent
In Turing’s game, there are three participants: two people and a computer One ofthe people is a contestant and plays the role of an interrogator To win, the interrogatormust determine which of the other two participants is the machine by asking a series
of questions via a teletype The task of the machine is to fool the interrogator intobelieving it is a person by responding as a person would to the interrogator’s questions.The task of the second human participant is to convince the interrogator that the otherparticipant is the machine, and that they are human
The following interaction from Turing’s paper illustrates the kind of interactions
he had in mind Clearly a convincing simulation of humanity doesn’t require being anexpert in all domains:
Q: Please write me a sonnet on the topic of the Forth Bridge
A: Count me out on this one I never could write poetry
Q: Add 34957 to 70764
A: (Pause about 30 seconds and then give answer as) 105621
Turing predicted that roughly by the end of the twentieth century a machine with 10
Trang 7D R
gigabytes of memory would have around a 30% chance of fooling a human interrogatorafter 5 minutes of questioning Given the fact that you can fool some of the people allthe time, it is not clear how rigorous this particular standard is Nevertheless, thecritical issue for Turing was that using language as humans do is sufficient, by itself,
as an operational test for intelligence
Surprisingly, many of the issues raised by Turing became relevant as early as 1966
in the context of the ELIZA program (Weizenbaum, 1966) ELIZA was an early ral language processing system capable of carrying on a limited form of conversationwith a user Consider the following session with a version of ELIZA that imitated theresponses of a Rogerian psychotherapist
natu-User1: You are like my father in some ways
ELIZA1: WHAT RESEMBLANCE DO YOU SEEUser2: You are not very aggressive but I think you don’t want me to notice that.ELIZA2: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE
User3: You don’t argue with me
ELIZA3: WHY DO YOU THINK I DON’T ARGUE WITH YOUUser4: You are afraid of me
ELIZA4: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU
As we will discuss in more detail in Ch 2, ELIZA is a remarkably simple programthat makes use of pattern-matching to process the input and translate it into suitableoutputs The success of this simple technique in this domain is due to the fact that
ELIZA doesn’t actually need to know anything to mimic a Rogerian psychotherapist.
As Weizenbaum notes, this is one of the few dialogue genres where the listener can act
as if they know nothing of the world
ELIZA’s deep relevance to Turing’s ideas is that many people who interacted with
ELIZA came to believe that it really understood them and their problems Indeed,
Weizenbaum (1976) notes that many of these people continued to believe in ELIZA’sabilities even after the program’s operation was explained to them In more recentyears, Weizenbaum’s informal reports have been repeated in a somewhat more con-trolled setting Since 1991, an event known as the Loebner Prize competition hasattempted to put various computer programs to the Turing test Although these con-tests seem to have little scientific interest, a consistent result over the years has beenthat even the crudest programs can fool some of the judges some of the time (Shieber,1994) Not surprisingly, these results have done nothing to quell the ongoing debateover the suitability of the Turing test as a test for intelligence among philosophers and
AI researchers (Searle, 1980)
Fortunately, for the purposes of this book, the relevance of these results does nothinge on whether or not computers will ever be intelligent, or understand natural lan-guage Far more important is recent related research in the social sciences that hasconfirmed another of Turing’s predictions from the same paper
Nevertheless I believe that at the end of the century the use of words andeducated opinion will have altered so much that we will be able to speak
of machines thinking without expecting to be contradicted
It is now clear that regardless of what people believe or know about the inner workings
of computers, they talk about them and interact with them as social entities People act
Trang 8D R
toward computers as if they were people; they are polite to them, treat them as teammembers, and expect among other things that computers should be able to understandtheir needs, and be capable of interacting with them naturally For example, Reevesand Nass (1996) found that when a computer asked a human to evaluate how well thecomputer had been doing, the human gives more positive responses than when a differ-ent computer asks the same questions People seemed to be afraid of being impolite In
a different experiment, Reeves and Nass found that people also give computers higherperformance ratings if the computer has recently said something flattering to the hu-man Given these predispositions, speech and language-based systems may providemany users with the most natural interface for many applications This fact has led to
a long-term focus in the field on the design of conversational agents, artificial entities
that communicate conversationally
We can only see a short distance ahead, but we can see plenty there that needs to
be done
Alan Turing.This is an exciting time for the field of speech and language processing Thestartling increase in computing resources available to the average computer user, therise of the Web as a massive source of information and the increasing availability ofwireless mobile access have all placed speech and language processing applications
in the technology spotlight The following are examples of some currently deployedsystems that reflect this trend:
• Travelers calling Amtrak, United Airlines and other travel-providers interact
with conversational agents that guide them through the process of making vations and getting arrival and departure information
reser-• Luxury car makers such as Mercedes-Benz models provide automatic speech
recognition and text-to-speech systems that allow drivers to control their ronmental, entertainment and navigational systems by voice A similar spokendialogue system has been deployed by astronauts on the International Space Sta-tion
envi-• Blinkx, and other video search companies, provide search services for million of
hours of video on the Web by using speech recognition technology to capture thewords in the sound track
• Google provides cross-language information retrieval and translation services
where a user can supply queries in their native language to search collections inanother language Google translates the query, finds the most relevant pages andthen automatically translates them back to the user’s native language
• Large educational publishers such as Pearson, as well as testing services like
ETS, use automated systems to analyze thousands of student essays, grading andassessing them in a manner that is indistinguishable from human graders
Trang 9D R
• Interactive tutors, based on lifelike animated characters, serve as tutors for
chil-dren learning to read, and as therapists for people dealing with aphasia andParkinsons disease (?, ?)
• Text analysis companies such as Nielsen Buzzmetrics, Umbria, and Collective
Intellect, provide marketing intelligence based on automated measurements ofuser opinions, preferences, attitudes as expressed in weblogs, discussion forumsand and user groups
Historically, speech and language processing has been treated very differently in puter science, electrical engineering, linguistics, and psychology/cognitive science.Because of this diversity, speech and language processing encompasses a number of
com-different but overlapping fields in these com-different departments: computational
linguis-tics in linguislinguis-tics, natural language processing in computer science, speech tion in electrical engineering, computational psycholinguistics in psychology This
recogni-section summarizes the different historical threads which have given rise to the field
of speech and language processing This section will provide only a sketch; see theindividual chapters for more detail on each area and its terminology
1.6.1 Foundational Insights: 1940s and 1950s
The earliest roots of the field date to the intellectually fertile period just after WorldWar II that gave rise to the computer itself This period from the 1940s through the end
of the 1950s saw intense work on two foundational paradigms: the automaton and
probabilistic or information-theoretic models.
The automaton arose in the 1950s out of Turing’s (1936) model of algorithmiccomputation, considered by many to be the foundation of modern computer science
Turing’s work led first to the McCulloch-Pitts neuron (McCulloch and Pitts, 1943), a
simplified model of the neuron as a kind of computing element that could be described
in terms of propositional logic, and then to the work of Kleene (1951) and (1956) onfinite automata and regular expressions Shannon (1948) applied probabilistic models
of discrete Markov processes to automata for language Drawing the idea of a state Markov process from Shannon’s work, Chomsky (1956) first considered finite-state machines as a way to characterize a grammar, and defined a finite-state language
finite-as a language generated by a finite-state grammar These early models led to the field of
formal language theory, which used algebra and set theory to define formal languages
as sequences of symbols This includes the context-free grammar, first defined byChomsky (1956) for natural languages but independently discovered by Backus (1959)and Naur et al (1960) in their descriptions of the ALGOL programming language.The second foundational insight of this period was the development of probabilisticalgorithms for speech and language processing, which dates to Shannon’s other con-
tribution: the metaphor of the noisy channel and decoding for the transmission of
language through media like communication channels and speech acoustics Shannon
Trang 10D R
also borrowed the concept of entropy from thermodynamics as a way of measuring
the information capacity of a channel, or the information content of a language, andperformed the first measure of the entropy of English using probabilistic techniques
It was also during this early period that the sound spectrograph was developed(Koenig et al., 1946), and foundational research was done in instrumental phoneticsthat laid the groundwork for later work in speech recognition This led to the firstmachine speech recognizers in the early 1950s In 1952, researchers at Bell Labs built
a statistical system that could recognize any of the 10 digits from a single speaker(Davis et al., 1952) The system had 10 speaker-dependent stored patterns roughlyrepresenting the first two vowel formants in the digits They achieved 97–99% accuracy
by choosing the pattern which had the highest relative correlation coefficient with theinput
1.6.2 The Two Camps: 1957–1970
By the end of the 1950s and the early 1960s, speech and language processing had splitvery cleanly into two paradigms: symbolic and stochastic
The symbolic paradigm took off from two lines of research The first was the work
of Chomsky and others on formal language theory and generative syntax throughout thelate 1950s and early to mid 1960s, and the work of many linguistics and computer sci-entists on parsing algorithms, initially top-down and bottom-up and then via dynamicprogramming One of the earliest complete parsing systems was Zelig Harris’s Trans-formations and Discourse Analysis Project (TDAP), which was implemented betweenJune 1958 and July 1959 at the University of Pennsylvania (Harris, 1962).2 The sec-ond line of research was the new field of artificial intelligence In the summer of 1956John McCarthy, Marvin Minsky, Claude Shannon, and Nathaniel Rochester broughttogether a group of researchers for a two-month workshop on what they decided to callartificial intelligence (AI) Although AI always included a minority of researchers fo-cusing on stochastic and statistical algorithms (include probabilistic models and neuralnets), the major focus of the new field was the work on reasoning and logic typified byNewell and Simon’s work on the Logic Theorist and the General Problem Solver Atthis point early natural language understanding systems were built, These were simplesystems that worked in single domains mainly by a combination of pattern matchingand keyword search with simple heuristics for reasoning and question-answering Bythe late 1960s more formal logical systems were developed
The stochastic paradigm took hold mainly in departments of statistics and of trical engineering By the late 1950s the Bayesian method was beginning to be applied
elec-to the problem of optical character recognition Bledsoe and Browning (1959) built
a Bayesian system for text-recognition that used a large dictionary and computed thelikelihood of each observed letter sequence given each word in the dictionary by mul-tiplying the likelihoods for each letter Mosteller and Wallace (1964) applied Bayesian
methods to the problem of authorship attribution on The Federalist papers.
The 1960s also saw the rise of the first serious testable psychological models of
2 This system was reimplemented recently and is described by Joshi and Hopely (1999) and Karttunen (1999), who note that the parser was essentially implemented as a cascade of finite-state transducers.
Trang 11D R
human language processing based on transformational grammar, as well as the firston-line corpora: the Brown corpus of American English, a 1 million word collection ofsamples from 500 written texts from different genres (newspaper, novels, non-fiction,academic, etc.), which was assembled at Brown University in 1963–64 (Kuˇcera andFrancis, 1967; Francis, 1979; Francis and Kuˇcera, 1982), and William S Y Wang’s
1967 DOC (Dictionary on Computer), an on-line Chinese dialect dictionary
1.6.3 Four Paradigms: 1970–1983
The next period saw an explosion in research in speech and language processing andthe development of a number of research paradigms that still dominate the field
The stochastic paradigm played a huge role in the development of speech
recog-nition algorithms in this period, particularly the use of the Hidden Markov Model andthe metaphors of the noisy channel and decoding, developed independently by Jelinek,Bahl, Mercer, and colleagues at IBM’s Thomas J Watson Research Center, and byBaker at Carnegie Mellon University, who was influenced by the work of Baum andcolleagues at the Institute for Defense Analyses in Princeton AT&T’s Bell Laborato-ries was also a center for work on speech recognition and synthesis; see Rabiner andJuang (1993) for descriptions of the wide range of this work
The logic-based paradigm was begun by the work of Colmerauer and his
col-leagues on Q-systems and metamorphosis grammars (Colmerauer, 1970, 1975), theforerunners of Prolog, and Definite Clause Grammars (Pereira and Warren, 1980) In-dependently, Kay’s (1979) work on functional grammar, and shortly later, Bresnan andKaplan’s (1982) work on LFG, established the importance of feature structure unifica-tion
The natural language understanding field took off during this period, beginning
with Terry Winograd’s SHRDLU system, which simulated a robot embedded in a world
of toy blocks (Winograd, 1972) The program was able to accept natural language text
commands (Move the red block on top of the smaller green one) of a hitherto unseen
complexity and sophistication His system was also the first to attempt to build anextensive (for the time) grammar of English, based on Halliday’s systemic grammar.Winograd’s model made it clear that the problem of parsing was well-enough under-stood to begin to focus on semantics and discourse models Roger Schank and his
colleagues and students (in what was often referred to as the Yale School) built a
se-ries of language understanding programs that focused on human conceptual knowledgesuch as scripts, plans and goals, and human memory organization (Schank and Albel-son, 1977; Schank and Riesbeck, 1981; Cullingford, 1981; Wilensky, 1983; Lehnert,1977) This work often used network-based semantics (Quillian, 1968; Norman andRumelhart, 1975; Schank, 1972; Wilks, 1975b, 1975a; Kintsch, 1974) and began toincorporate Fillmore’s notion of case roles (Fillmore, 1968) into their representations(Simmons, 1973)
The logic-based and natural-language understanding paradigms were unified onsystems that used predicate logic as a semantic representation, such as the LUNARquestion-answering system (Woods, 1967, 1973)
The discourse modeling paradigm focused on four key areas in discourse Grosz
and her colleagues introduced the study of substructure in discourse, and of discourse
Trang 12D R
focus (Grosz, 1977; Sidner, 1983), a number of researchers began to work on automatic
reference resolution (Hobbs, 1978), and the BDI (Belief-Desire-Intention) framework
for logic-based work on speech acts was developed (Perrault and Allen, 1980; Cohenand Perrault, 1979)
1.6.4 Empiricism and Finite State Models Redux: 1983–1993
This next decade saw the return of two classes of models which had lost popularity inthe late 1950s and early 1960s, partially due to theoretical arguments against them such
as Chomsky’s influential review of Skinner’s Verbal Behavior (Chomsky, 1959) The
first class was finite-state models, which began to receive attention again after work
on finite-state phonology and morphology by Kaplan and Kay (1981) and finite-statemodels of syntax by Church (1980) A large body of work on finite-state models will
be described throughout the book
The second trend in this period was what has been called the “return of cism”; most notably here was the rise of probabilistic models throughout speech andlanguage processing, influenced strongly by the work at the IBM Thomas J WatsonResearch Center on probabilistic models of speech recognition These probabilisticmethods and other such data-driven approaches spread from speech into part-of-speechtagging, parsing and attachment ambiguities, and semantics This empirical directionwas also accompanied by a new focus on model evaluation, based on using held-outdata, developing quantitative metrics for evaluation, and emphasizing the comparison
empiri-of performance on these metrics with previous published research
This period also saw considerable work on natural language generation
1.6.5 The Field Comes Together: 1994–1999
By the last five years of the millennium it was clear that the field was vastly ing First, probabilistic and data-driven models had become quite standard throughoutnatural language processing Algorithms for parsing, part-of-speech tagging, referenceresolution, and discourse processing all began to incorporate probabilities, and employevaluation methodologies borrowed from speech recognition and information retrieval.Second, the increases in the speed and memory of computers had allowed commercialexploitation of a number of subareas of speech and language processing, in particularspeech recognition and spelling and grammar checking Speech and language process-ing algorithms began to be applied to Augmentative and Alternative Communication(AAC) Finally, the rise of the Web emphasized the need for language-based informa-tion retrieval and information extraction
chang-’
1.6.6 The Rise of Machine Learning: 2000–2007
The empiricist trends begun in the latter part of the 1990s accelerated at an ing pace in the new century This acceleration was largely driven by three synergistictrends First, large amounts of spoken and written material became widely availablethrough the auspices of the Linguistic Data Consortium (LDC), and other similar or-
Trang 13astound-D R
ganizations Importantly, included among these materials were annotated collectionssuch as the Penn Treebank(Marcus et al., 1993), Prague Dependency Treebank(Hajiˇc,1998), PropBank(Palmer et al., 2005), Penn Discourse Treebank(Miltsakaki et al.,2004), RSTBank(Carlson et al., 2001) and TimeBank(?), all of which layered standardtext sources with various forms of syntactic, semantic and pragmatic annotations Theexistence of these resources promoted the trend of casting more complex traditionalproblems, such as parsing and semantic analysis, as problems in supervised machinelearning These resources also promoted the establishment of additional competitiveevaluations for parsing (Dejean and Tjong Kim Sang, 2001), information extraction(?,
?), word sense disambiguation(Palmer et al., 2001; Kilgarriff and Palmer, 2000) andquestion answering(Voorhees and Tice, 1999)
Second, this increased focus on learning led to a more serious interplay with thestatistical machine learning community Techniques such as support vector machines(?; Vapnik, 1995), multinomial logistic regression (MaxEnt) (Berger et al., 1996), andgraphical Bayesian models (Pearl, 1988) became standard practice in computationallinguistics Third, the widespread availability of high-performance computing systemsfacilitated the training and deployment of systems that could not have been imagined adecade earlier
Finally, near the end of this period, largely unsupervised statistical approaches gan to receive renewed attention Progress on statistical approaches to machine trans-lation(Brown et al., 1990; Och and Ney, 2003) and topic modeling (?) demonstratedthat effective applications could be constructed from systems trained on unannotateddata alone In addition, the widespread cost and difficulty of producing reliably anno-tated corpora became a limiting factor in the use of supervised approaches for manyproblems This trend towards the use unsupervised techniques will likely increase
be-1.6.7 On Multiple Discoveries
Even in this brief historical overview, we have mentioned a number of cases of multipleindependent discoveries of the same idea Just a few of the “multiples” to be discussed
in this book include the application of dynamic programming to sequence comparison
by Viterbi, Vintsyuk, Needleman and Wunsch, Sakoe and Chiba, Sankoff, Reichert
et al., and Wagner and Fischer (Chapters 3, 5 and 6) the HMM/noisy channel model
of speech recognition by Baker and by Jelinek, Bahl, and Mercer (Chapters 6, 9, and10); the development of context-free grammars by Chomsky and by Backus and Naur(Chapter 12); the proof that Swiss-German has a non-context-free syntax by Huybregtsand by Shieber (Chapter 15); the application of unification to language processing by
Colmerauer et al and by Kay in (Chapter 16).
Are these multiples to be considered astonishing coincidences? A well-known pothesis by sociologist of science Robert K Merton (1961) argues, quite the contrary,that
hy-all scientific discoveries are in principle multiples, including those that onthe surface appear to be singletons
Of course there are many well-known cases of multiple discovery or invention; just afew examples from an extensive list in Ogburn and Thomas (1922) include the multiple
Trang 14D R
invention of the calculus by Leibnitz and by Newton, the multiple development of thetheory of natural selection by Wallace and by Darwin, and the multiple invention ofthe telephone by Gray and Bell.3 But Merton gives a further array of evidence for thehypothesis that multiple discovery is the rule rather than the exception, including manycases of putative singletons that turn out be a rediscovery of previously unpublished orperhaps inaccessible work An even stronger piece of evidence is his ethnomethodolog-ical point that scientists themselves act under the assumption that multiple invention isthe norm Thus many aspects of scientific life are designed to help scientists avoid be-ing “scooped”; submission dates on journal articles; careful dates in research records;circulation of preliminary or technical reports
1.6.8 A Final Brief Note on Psychology
Many of the chapters in this book include short summaries of psychological research
on human processing Of course, understanding human language processing is an portant scientific goal in its own right and is part of the general field of cognitive sci-ence However, an understanding of human language processing can often be helpful
im-in buildim-ing better machim-ine models of language This seems contrary to the popularwisdom, which holds that direct mimicry of nature’s algorithms is rarely useful in en-gineering applications For example, the argument is often made that if we copiednature exactly, airplanes would flap their wings; yet airplanes with fixed wings are amore successful engineering solution But language is not aeronautics Cribbing fromnature is sometimes useful for aeronautics (after all, airplanes do have wings), but it isparticularly useful when we are trying to solve human-centered tasks Airplane flighthas different goals than bird flight; but the goal of speech recognition systems, for ex-ample, is to perform exactly the task that human court reporters perform every day:transcribe spoken dialog Since people already do this well, we can learn from nature’sprevious solution Since an important application of speech and language processingsystems is for human-computer interaction, it makes sense to copy a solution that be-haves the way people are accustomed to
This chapter introduces the field of speech and language processing The following aresome of the highlights of this chapter
• A good way to understand the concerns of speech and language processing
re-search is to consider what it would take to create an intelligent agent like HALfrom 2001: A Space Odyssey, or build a web-based question answerer, or a ma-chine translation engine
• Speech and language technology relies on formal models, or representations, of
3 Ogburn and Thomas are generally credited with noticing that the prevalence of multiple inventions gests that the cultural milieu and not individual genius is the deciding causal factor in scientific discovery In
sug-an amusing bit of recursion, however, Merton notes that even this idea has been multiply discovered, citing sources from the 19th century and earlier!
Trang 15• The foundations of speech and language technology lie in computer science,
lin-guistics, mathematics, electrical engineering and psychology A small number ofalgorithms from standard frameworks are used throughout speech and languageprocessing,
• The critical connection between language and thought has placed speech and
language processing technology at the center of debate over intelligent machines.Furthermore, research on how people interact with complex media indicates thatspeech and language processing technology will be critical in the development
of future technologies
• Revolutionary applications of speech and language processing are currently in
use around the world The creation of the web, as well as significant recentimprovements in speech recognition and synthesis, will lead to many more ap-plications
Research in the various subareas of speech and language processing is spread across
a wide number of conference proceedings and journals The conferences and journalsmost centrally concerned with natural language processing and computational linguis-tics are associated with the Association for Computational Linguistics (ACL), its Eu-ropean counterpart (EACL), and the International Conference on Computational Lin-guistics (COLING) The annual proceedings of ACL, NAACL, and EACL, and thebiennial COLING conference are the primary forums for work in this area Relatedconferences include various proceedings of ACL Special Interest Groups (SIGs) such
as the Conference on Natural Language Learning (CoNLL), as well as the conference
on Empirical Methods in Natural Language Processing (EMNLP)
Research on speech recognition, understanding, and synthesis is presented at theannual INTERSPEECH conference, which is called the International Conference onSpoken Language Processing (ICSLP) and the European Conference on Speech Com-munication and Technology (EUROSPEECH) in alternating years, or the annual IEEEInternational Conference on Acoustics, Speech, and Signal Processing (IEEE ICASSP).Spoken language dialogue research is presented at these or at workshops like SIGDial
Journals include Computational Linguistics, Natural Language Engineering, Speech Communication, Computer Speech and Language, the IEEE Transactions on Audio, Speech & Language Processing and the ACM Transactions on Speech and Language Processing.
Work on language processing from an Artificial Intelligence perspective can befound in the annual meetings of the American Association for Artificial Intelligence(AAAI), as well as the biennial International Joint Conference on Artificial Intelli-
Trang 16D R
gence (IJCAI) meetings Artificial intelligence journals that periodically feature work
on speech and language processing include Machine Learning, Journal of Machine Learning Research, and the Journal of Artificial Intelligence Research.
There are a fair number of textbooks available covering various aspects of speech
and language processing Manning and Sch ¨utze (1999) (Foundations of Statistical guage Processing) focuses on statistical models of tagging, parsing, disambiguation, collocations, and other areas Charniak (1993) (Statistical Language Learning) is an
Lan-accessible, though older and less-extensive, introduction to similar material Manning
et al (2008) focuses on information retrieval, text classification, and clustering NLTK,the Natural Language Toolkit (Bird and Loper, 2004), is a suite of Python modulesand data for natural language processing, together with a Natural Language Process-
ing book based on the NLTK suite Allen (1995) (Natural Language Understanding)
provides extensive coverage of language processing from the AI perspective Gazdar
and Mellish (1989) (Natural Language Processing in Lisp/Prolog) covers especially
automata, parsing, features, and unification and is available free online Pereira andShieber (1987) gives a Prolog-based introduction to parsing and interpretation Russelland Norvig (2002) is an introduction to artificial intelligence that includes chapters onnatural language processing Partee et al (1990) has a very broad coverage of mathe-matical linguistics A historically significant collection of foundational papers can be
found in Grosz et al (1986) (Readings in Natural Language Processing).
Of course, a wide-variety of speech and language processing resources are nowavailable on the Web Pointers to these resources are maintained on the home-page forthis book at:
http://www.cs.colorado.edu/˜martin/slp.html
Trang 17D R
Allen, J (1995) Natural Language Understanding Benjamin
Cummings, Menlo Park, CA.
Backus, J W (1959) The syntax and semantics of the proposed
international algebraic language of the Zurch ACM-GAMM
Conference In Information Processing: Proceedings of the
International Conference on Information Processing, Paris,
pp 125–132 UNESCO.
Berger, A., Della Pietra, S A., and Della Pietra, V J (1996) A
maximum entropy approach to natural language processing.
Computational Linguistics, 22(1), 39–71.
Bird, S and Loper, E (2004) NLTK: The Natural Language
Toolkit In Proceedings of the ACL 2004 demonstration
ses-sion, Barcelona, Spain, pp 214–217.
Bledsoe, W W and Browning, I (1959) Pattern recognition
and reading by machine In 1959 Proceedings of the Eastern
Joint Computer Conference, pp 225–232 Academic, New
York.
Bresnan, J and Kaplan, R M (1982) Introduction: Grammars
as mental representations of language In Bresnan, J (Ed.),
The Mental Representation of Grammatical Relations MIT
Press, Cambridge, MA.
Brown, P F., Cocke, J., Della Pietra, S A., Della Pietra, V J.,
Jelinek, F., Lafferty, J D., Mercer, R L., and Roossin, P S.
(1990) A statistical approach to machine translation
Com-putational Linguistics, 16(2), 79–85.
Carlson, L., Marcu, D., and Okurowski, M E (2001)
Build-ing a discourse-tagged corpus in the framework of rhetorical
structure theory In Proceedings of SIGDIAL.
Charniak, E (1993) Statistical Language Learning MIT Press.
Chomsky, N (1956) Three models for the description of
lan-guage IRI Transactions on Information Theory, 2(3), 113–
124.
Chomsky, N (1959) A review of B F Skinner’s “Verbal
Be-havior” Language, 35, 26–58.
Church, K W (1980) On memory limitations in natural
lan-guage processing Master’s thesis, MIT Distributed by the
Indiana University Linguistics Club.
Cohen, P R and Perrault, C R (1979) Elements of a
plan-based theory of speech acts Cognitive Science, 3(3), 177–
212.
Colmerauer, A (1970) Les syst`emes-q ou un formalisme pour
analyser et synth´etiser des phrase sur ordinateur Internal
pub-lication 43, D´epartement d’informatique de l’Universit´e de
Montr´eal†.
Colmerauer, A (1975) Les grammaires de m´etamorphose GIA.
Internal publication, Groupe Intelligence artificielle, Facult´e
des Sciences de Luminy, Universit´e Aix-Marseille II, France,
Nov 1975 English version, Metamorphosis grammars In L.
Bolc, (Ed.), Natural Language Communication with
Comput-ers, Lecture Notes in Computer Science 63, Springer Verlag,
Berlin, 1978, pp 133–189.
Cullingford, R E (1981) SAM In Schank, R C and Riesbeck,
C K (Eds.), Inside Computer Understanding: Five Programs
plus Miniatures, pp 75–119 Lawrence Erlbaum, Hillsdale,
NJ.
Davis, K H., Biddulph, R., and Balashek, S (1952) Automatic
recognition of spoken digits Journal of the Acoustical Society
of America, 24(6), 637–642.
Dejean, H and Tjong Kim Sang, E F (2001) Introduction to
the CoNLL-2001 shared task: Clause identification In
Pro-ceedings of CoNLL-2001.
Fillmore, C J (1968) The case for case In Bach, E W and
Harms, R T (Eds.), Universals in Linguistic Theory, pp 1–
88 Holt, Rinehart & Winston, New York.
Francis, W N (1979) A tagged corpus – problems and prospects In Greenbaum, S., Leech, G., and Svartvik, J.
(Eds.), Studies in English linguistics for Randolph Quirk, pp.
192–209 Longman, London and New York.
Francis, W N and Kuˇcera, H (1982) Frequency Analysis of
English Usage Houghton Mifflin, Boston.
Gazdar, G and Mellish, C (1989) Natural Language
Process-ing in LISP Addison Wesley.
Grosz, B J (1977) The representation and use of focus in a
system for understanding dialogs In IJCAI-77, Cambridge,
MA, pp 67–76 Morgan Kaufmann Reprinted in Grosz et al (1986).
Grosz, B J., Jones, K S., and Webber, B L (Eds.) (1986).
Readings in Natural Language Processing Morgan
Kauf-mann, Los Altos, Calif.
Hajiˇc, J (1998) Building a Syntactically Annotated Corpus:
The Prague Dependency Treebank, pp 106–132 Karolinum,
Prague/Praha.
Harris, Z S (1962) String Analysis of Sentence Structure.
Mouton, The Hague.
Hobbs, J R (1978) Resolving pronoun references Lingua,
44, 311–338 Reprinted in Grosz et al (1986).
Joshi, A K and Hopely, P (1999) A parser from antiquity In
Kornai, A (Ed.), Extended Finite State Models of Language,
pp 6–15 Cambridge University Press, Cambridge Kaplan, R M and Kay, M (1981) Phonological rules and finite-state transducers Paper presented at the Annual meet- ing of the Linguistics Society of America New York Karttunen, L (1999) Comments on Joshi In Kornai, A (Ed.),
Extended Finite State Models of Language, pp 16–18
Cam-bridge University Press, CamCam-bridge.
Kay, M (1979) Functional grammar In BLS-79, Berkeley, CA,
pp 142–158.
Kilgarriff, A and Palmer, M (Eds.) (2000) Computing and the
Humanities: Special Issue on SENSEVAL, Vol 34 Kluwer.
Kintsch, W (1974) The Representation of Meaning in Memory.
Wiley, New York.
Kleene, S C (1951) Representation of events in nerve nets and finite automata Tech rep RM-704, RAND Corporation RAND Research Memorandum†.
Trang 18D R
Kleene, S C (1956) Representation of events in nerve nets and
finite automata In Shannon, C and McCarthy, J (Eds.),
Au-tomata Studies, pp 3–41 Princeton University Press,
Prince-ton, NJ.
Koenig, W., Dunn, H K., Y., L., and Lacy (1946) The sound
spectrograph Journal of the Acoustical Society of America,
18, 19–49.
Kuˇcera, H and Francis, W N (1967) Computational
analy-sis of present-day American English Brown University Press,
Providence, RI.
Lehnert, W G (1977) A conceptual theory of question
an-swering In IJCAI-77, Cambridge, MA, pp 158–164 Morgan
Kaufmann.
Manning, C D., Raghavan, P., and Sch¨utze, H (2008)
In-troduction to Information Retrieval Cambridge University
Press, Cambridge, UK.
Manning, C D and Sch¨utze, H (1999) Foundations of
Statis-tical Natural Language Processing MIT Press, Cambridge,
MA.
Marcus, M P., Santorini, B., and Marcinkiewicz, M A (1993).
Building a large annotated corpus of English: The Penn
tree-bank Computational Linguistics, 19(2), 313–330.
McCulloch, W S and Pitts, W (1943) A logical calculus of
ideas immanent in nervous activity Bulletin of Mathematical
Biophysics, 5, 115–133 Reprinted in Neurocomputing:
Foun-dations of Research, ed by J A Anderson and E Rosenfeld.
MIT Press 1988.
Merton, R K (1961) Singletons and multiples in scientific
dis-covery American Philosophical Society Proceedings, 105(5),
470–486.
Miltsakaki, E., Prasad, R., Joshi, A K., and Webber, B L.
(2004) The Penn Discourse Treebank In LREC-04.
Mosteller, F and Wallace, D L (1964) Inference and Disputed
Authorship: The Federalist Springer-Verlag, New York 2nd
Edition appeared in 1984 and was called Applied Bayesian
and Classical Inference.
Naur, P., Backus, J W., Bauer, F L., Green, J., Katz, C.,
McCarthy, J., Perlis, A J., Rutishauser, H., Samelson, K.,
Vauquois, B., Wegstein, J H., van Wijnagaarden, A., and
Woodger, M (1960) Report on the algorithmic language
AL-GOL 60 Communications of the ACM, 3(5), 299–314
Re-vised in CACM 6:1, 1-17, 1963.
Norman, D A and Rumelhart, D E (1975) Explorations in
Cognition Freeman, San Francisco, CA.
Och, F J and Ney, H (2003) A systematic comparison of
var-ious statistical alignment models Computational Linguistics,
29(1), 19–51.
Ogburn, W F and Thomas, D S (1922) Are inventions
in-evitable? A note on social evolution Political Science
Quar-terly, 37, 83–98.
Palmer, M., Fellbaum, C., Cotton, S., Delfs, L., and Dang,
H T (2001) English tasks: All-words and verb lexical
sam-ple In Proceedings of SENSEVAL-2: Second International
Workshop on Evaluating Word Sense Disambiguation tems, Toulouse, France.
Sys-Palmer, M., Kingsbury, P., and Gildea, D (2005) The
proposi-tion bank: An annotated corpus of semantic roles
Computa-tional Linguistics, 31(1), 71–106.
Partee, B H., ter Meulen, A., and Wall, R E (1990)
Mathe-matical Methods in Linguistics Kluwer, Dordrecht.
Pearl, J (1988) Probabilistic Reasoning in Intelligent Systems:
Networks of Plausible Inference Morgan Kaufman, San
Ma-teo, Ca.
Pereira, F C N and Shieber, S M (1987) Prolog and
Natural-Language Analysis, Vol 10 of CSLI Lecture Notes Chicago
University Press, Chicago.
Pereira, F C N and Warren, D H D (1980) Definite clause grammars for language analysis— a survey of the formalism
and a comparison with augmented transition networks
Artifi-cial Intelligence, 13(3), 231–278.
Perrault, C R and Allen, J (1980) A plan-based analysis of
indirect speech acts American Journal of Computational
Lin-guistics, 6(3-4), 167–182.
Quillian, M R (1968) Semantic memory In Minsky, M (Ed.),
Semantic Information Processing, pp 227–270 MIT Press,
Cambridge, MA.
Rabiner, L R and Juang, B (1993) Fundamentals of Speech
Recognition Prentice Hall, Englewood Cliffs, NJ.
Reeves, B and Nass, C (1996) The Media Equation: How
People Treat Computers, Television, and New Media Like Real People and Places Cambridge University Press, Cambridge.
Russell, S and Norvig, P (2002) Artificial Intelligence: A
Modern Approach Prentice Hall, Englewood Cliffs, NJ
Sec-ond edition.
Schank, R C (1972) Conceptual dependency: A theory of
nat-ural language processing Cognitive Psychology, 3, 552–631 Schank, R C and Albelson, R P (1977) Scripts, Plans, Goals
and Understanding Lawrence Erlbaum, Hillsdale, NJ.
Schank, R C and Riesbeck, C K (Eds.) (1981). Inside Computer Understanding: Five Programs plus Miniatures.
Lawrence Erlbaum, Hillsdale, NJ.
Searle, J R (1980) Minds, brains, and programs Behavioral
and Brain Sciences, 3, 417–457.
Shannon, C E (1948) A mathematical theory of
communica-tion Bell System Technical Journal, 27(3), 379–423
Contin-ued in following volume.
Shieber, S M (1994) Lessons from a restricted Turing test.
Communications of the ACM, 37(6), 70–78.
Sidner, C L (1983) Focusing in the comprehension of definite
anaphora In Brady, M and Berwick, R C (Eds.),
Compu-tational Models of Discourse, pp 267–330 MIT Press,
Cam-bridge, MA.
Simmons, R F (1973) Semantic networks: Their tion and use for understanding English sentences In Schank,
computa-R C and Colby, K M (Eds.), Computer Models of Thought
and Language, pp 61–113 W.H Freeman and Co., San
Fran-cisco.
Trang 19D R
Turing, A M (1936) On computable numbers, with an
ap-plication to the Entscheidungsproblem Proceedings of the
London Mathematical Society, 42, 230–265 Read to the
So-ciety in 1936, but published in 1937 Correction in volume 43,
544–546.
Turing, A M (1950) Computing machinery and intelligence.
Mind, 59, 433–460.
Vapnik, V N (1995) The Nature of Statistical Learning
The-ory Springer-Verlag, New York.
Voorhees, E M and Tice, D M (1999) The TREC-8
ques-tion answering track evaluaques-tion Proceedings of the TREC-8
Workshop.
Weizenbaum, J (1966) ELIZA – A computer program for the
study of natural language communication between man and
machine Communications of the ACM, 9(1), 36–45.
Weizenbaum, J (1976) Computer Power and Human Reason:
From Judgement to Calculation W.H Freeman and Company,
San Francisco.
Wilensky, R (1983) Planning and Understanding
Addison-Wesley, Reading, MA.
Wilks, Y (1975a) Preference semantics In Keenan, E L.
(Ed.), The Formal Semantics of Natural Language, pp 329–
350 Cambridge Univ Press, Cambridge.
Wilks, Y (1975b) A preferential, pattern-seeking, semantics
for natural language inference Artificial Intelligence, 6(1),
53–74.
Winograd, T (1972) Understanding natural language
Cogni-tive Psychology, 3(1), 1–191 Reprinted as a book by
Aca-demic Press, 1972.
Woods, W A (1967) Semantics for a Question-Answering
System Ph.D thesis, Harvard University.
Woods, W A (1973) Progress in natural language
understand-ing In Proceedings of AFIPS National Conference, pp 441–
450.
Trang 20D R
Copyright c 2007, All rights reserved Draft of October 11, 2007 Do not cite
without permission.
User1: Men are all alike.
ELIZA1: IN WHAT WAY User2: They’re always bugging us about something or other ELIZA2: CAN YOU THINK OF A SPECIFIC EXAMPLE User3: Well, my boyfriend made me come here.
ELIZA3: YOUR BOYFRIEND MADE YOU COME HERE User4: He says I’m depressed much of the time.
ELIZA4: I AM SORRY TO HEAR YOU ARE DEPRESSED.
inspira-and lemurs” inspira-and “all about Vermont’s unique, endangered species” Instead of having
to do this search twice, you would have rather typed one search command
specify-ing somethspecify-ing like woodchuck with an optional final s Or perhaps you might want
to search for all the prices in some document; you might want to see all strings thatlook like$199 or$25 or$24.99 In this chapter we introduce the regular expression,
the standard notation for characterizing text sequences The regular expression is usedfor specifying text strings in situations like this Web-search example, and in other in-formation retrieval applications, but also plays an important role in word-processing,computation of frequencies from corpora, and other such tasks
After we have defined regular expressions, we show how they can be implemented
via the finite-state automaton The finite-state automaton is not only the
mathemati-cal device used to implement regular expressions, but also one of the most significanttools of computational linguistics Variations of automata such as finite-state trans-
ducers, Hidden Markov Models, and N-gram grammars are important components of
applications that we will introduce in later chapters, including speech recognition andsynthesis, machine translation, spell-checking, and information-extraction
Trang 21D R
SIR ANDREW: Her C’s, her U’s and her T’s: why that?
Shakespeare, Twelfth Night
One of the unsung successes in standardization in computer science has been the
regular expression (RE), a language for specifying text search strings The regular
A regular expression (first developed by Kleene (1956) but see the History sectionfor more details) is a formula in a special language that is used for specifying simple
classes of strings A string is a sequence of symbols; for the purpose of most
text-STRINGS
based search techniques, a string is any sequence of alphanumeric characters (letters,numbers, spaces, tabs, and punctuation) For these purposes a space is just a characterlike any other, and we represent it with the symbol
Formally, a regular expression is an algebraic notation for characterizing a set ofstrings Thus they can be used to specify search strings as well as to define a language in
a formal way We will begin by talking about regular expressions as a way of specifyingsearches in texts, and proceed to other uses Section 2.3 shows that the use of justthree regular expression operators is sufficient to characterize strings, but we use themore convenient and commonly-used regular expression syntax of the Perl languagethroughout this section Since common text-processing programs agree on most of thesyntax of regular expressions, most of what we say extends to all UNIX, MicrosoftWord, and WordPerfect regular expressions Appendix A shows the few areas wherethese programs differ from the Perl syntax
Regular expression search requires a pattern that we want to search for, and a
cor-pus of texts to search through A regular expression search function will search through
we will assume that the search engine returns the line of the document returned This is
what the UNIXgrepcommand does We will underline the exact part of the patternthat matches the regular expression A search can be designed to return all matches to
a regular expression or only the first match We will show only the first match
2.1.1 Basic Regular Expression Patterns
The simplest kind of regular expression is a sequence of simple characters For
ex-ample, to search for woodchuck, we type/woodchuck/ So the regular expression/Buttercup/matches any string containing the substring Buttercup, for example the line I’m called little Buttercup) (recall that we are assuming a search application
that returns entire lines) From here on we will put slashes around each regular
Trang 22expres-D R
sion to make it clear what is a regular expression and what is a pattern We use the
slash since this is the notation used by Perl, but the slashes are not part of the regular
expressions
The search string can consist of a single character (like /!/) or a sequence ofcharacters (like/urgl/); The first instance of each match to the regular expression isunderlined below (although a given application might choose to return more than justthe first instance):
/woodchucks/ “interesting links to woodchucks and lemurs”
/Claire says,/ “Dagmar, my gift please,” Claire says,”
/!/ “You’ve left the burglar behind again!” said Nori
Regular expressions are case sensitive; lowercase/s/is distinct from uppercase/S/(/s/matches a lower case s but not an uppercase S) This means that the pattern
/woodchucks/will not match the string Woodchucks We can solve this problem
with the use of the square braces[and] The string of characters inside the braces
specify a disjunction of characters to match For example Fig 2.1 shows that the
pattern/[wW]/matches patterns containing either w or W.
/[wW]oodchuck/ Woodchuck or woodchuck “Woodchuck”
Figure 2.1 The use of the brackets[]to specify a disjunction of characters
The regular expression/[1234567890]/specified any single digit While classes
of characters like digits or letters are important building blocks in expressions, they canget awkward (e.g., it’s inconvenient to specify
/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/
to mean “any capital letter”) In these cases the brackets can be used with the dash (-)
to specify any one character in a range The pattern/[2-5]/specifies any one of theRANGE
characters 2, 3, 4, or 5 The pattern/[b-g]/specifies one of the characters b, c, d, e,
f, or g Some other examples:
/[A-Z]/ an uppercase letter “we should call it ‘Drenched Blossoms’”/[a-z]/ a lowercase letter “my beans were impatient to be hoed!”/[0-9]/ a single digit “Chapter 1: Down the Rabbit Hole”
Figure 2.2 The use of the brackets[]plus the dash-to specify a range
The square braces can also be used to specify what a single character cannot be,
by use of the caretˆ If the caretˆis the first symbol after the open square brace[,
Trang 23D R
the resulting pattern is negated For example, the pattern/[ˆa]/matches any single
character (including special characters) except a This is only true when the caret is the
first symbol after the open square brace If it occurs anywhere else, it usually standsfor a caret; Fig 2.3 shows some examples
RE Match (single characters) Example Patterns Matched[ˆA-Z] not an uppercase letter “Oyfn pripetchik”
[ˆSs] neither ‘S’ nor ‘s’ “I have no exquisite reason for’t”
Figure 2.3 Uses of the caretˆ for negation or just to meanˆ
The use of square braces solves our capitalization problem for woodchucks But
we still haven’t answered our original question; how do we specify both woodchuck and woodchucks? We can’t use the square brackets, because while they allow us to say
“s or S”, they don’t allow us to say “s or nothing” For this we use the question-mark/?/, which means “the preceding character or nothing”, as shown in Fig 2.4
woodchucks? woodchuck or woodchucks “woodchuck”
Figure 2.4 The question-mark?marks optionality of the previous expression
We can think of the question-mark as meaning “zero or one instances of the ous character” That is, it’s a way of specifying how many of something that we want
previ-So far we haven’t needed to specify that we want more than one of something Butsometimes we need regular expressions that allow repetitions of things For example,consider the language of (certain) sheep, which consists of strings that look like thefollowing:
This language consists of strings with a b, followed by at least two as, followed by
an exclamation point The set of operators that allow us to say things like “some
num-ber of as” are based on the asterisk or*, commonly called the Kleene * (pronounced
Trang 24D R
meaning one a followed by zero or more as More complex patterns can also be
re-peated So/[ab]*/means “zero or more as or bs” (not “zero or more right square braces”) This will match strings like aaaa or ababab or bbbb.
We now know enough to specify part of our regular expression for prices: multipledigits Recall that the regular expression for an individual digit was/[0-9]/ So theregular expression for an integer (a string of digits) is/[0-9][0-9]*/ (Why isn’t
it just/[0-9]*/?)Sometimes it’s annoying to have to write the regular expression for digits twice, so
there is a shorter way to specify “at least one” of some character This is the Kleene +,
KLEENE +
which means “one or more of the previous character” Thus the expression/[0-9]+/
is the normal way to specify “a sequence of digits” There are thus two ways to specifythe sheep language:/baaa*!/or/baa+!/
One very important special character is the period (/./), a wildcard expression
that matches any single character (except a carriage return):
/beg.n/ any character between beg and n begin, beg’n, begun
Figure 2.5 The use of the period.to specify any character
The wildcard is often used together with the Kleene star to mean “any string ofcharacters” For example suppose we want to find any line in which a particular word,
for example aardvark, appears twice We can specify this with the regular expression
/aardvark.*aardvark/
Anchors are special characters that anchor regular expressions to particular places
ANCHORS
in a string The most common anchors are the caretˆand the dollar-sign$ The caret
ˆmatches the start of a line The pattern/ˆThe/matches the word The only at the
start of a line Thus there are three uses of the caretˆ: to match the start of a line, as
a negation inside of square brackets, and just to mean a caret (What are the contextsthat allow Perl to know which function a given caret is supposed to have?) The dollarsign$matches the end of a line So the pattern $is a useful pattern for matching
a space at the end of a line, and/ˆThe dog\.$/matches a line that contains only
the phrase The dog (We have to use the backslash here since we want the.to mean
“period” and not the wildcard.)There are also two other anchors:\bmatches a word boundary, while\Bmatches
a non-boundary Thus /\bthe\b/matches the word the but not the word other.
More technically, Perl defines a word as any sequence of digits, underscores or letters;this is based on the definition of “words” in programming languages like Perl or C Forexample,/\b99\b/will match the string 99 in There are 99 bottles of beer on the wall (because 99 follows a space) but not 99 in There are 299 bottles of beer on the wall (since 99 follows a number) But it will match 99 in$99 (since 99 follows a dollar
sign ($), which is not a digit, underscore, or letter)
Trang 25D R
2.1.2 Disjunction, Grouping, and Precedence
Suppose we need to search for texts about pets; perhaps we are particularly interested
in cats and dogs In such a case we might want to search for either the string cat or the string dog Since we can’t use the square-brackets to search for “cat or dog” (why
not?) we need a new operator, the disjunction operator, also called the pipe symbol|.DISJUNCTION
The pattern/cat|dog/matches either the stringcator the stringdog
Sometimes we need to use this disjunction operator in the midst of a larger quence For example, suppose I want to search for information about pet fish for my
se-cousin David How can I specify both guppy and guppies? We cannot simply say
/guppy|ies/, because that would match only the strings guppy and ies This isbecause sequences like guppytake precedence over the disjunction operator| InPRECEDENCE
order to make the disjunction operator apply only to a specific pattern, we need to usethe parenthesis operators(and) Enclosing a pattern in parentheses makes it act like
a single character for the purposes of neighboring operators like the pipe| and theKleene* So the pattern/gupp(y|ies)/would specify that we meant the disjunc-tion only to apply to the suffixesyandies
The parenthesis operator ( is also useful when we are using counters like theKleene* Unlike the| operator, the Kleene*operator applies by default only to asingle character, not a whole sequence Suppose we want to match repeated instances
of a string Perhaps we have a line that has column labels of the form Column 1 umn 2 Column 3 The expression/Column [0-9]+ */will not match any col-umn; instead, it will match a column followed by any number of spaces! The star hereapplies only to the space that precedes it, not the whole sequence With the paren-theses, we could write the expression/(Column [0-9]+ *)*/to match the word
Col-Column, followed by a number and optional spaces, the whole pattern repeated any
number of times
This idea that one operator may take precedence over another, requiring us to
some-times use parentheses to specify what we mean, is formalized by the operator
prece-dence hierarchy for regular expressions The following table gives the order of RE
than disjunction,/the|any/matches the or any but not theny.
Patterns can be ambiguous in another way Consider the expression/[a-z]*/
when matching against the text once upon a time Since/[a-z]*/matches zero or
more letters, this expression could match nothing, or just the first letter o, or on, or onc,
or once In these cases regular expressions always match the largest string they can;
we say that patterns are greedy, expanding to cover as much of a string as they can.
GREEDY
Trang 26D R
2.1.3 A Simple Example
Suppose we wanted to write a RE to find cases of the English article the A simple (but
incorrect) pattern might be:
/the/
One problem is that this pattern will miss the word when it begins a sentence and
hence is capitalized (i.e., The) This might lead us to the following pattern:
/[tT]he/
But we will still incorrectly return texts withtheembedded in other words (e.g.,
other or theology) So we need to specify that we want instances with a word boundary
on both sides:
/\b[tT]he\b/
Suppose we wanted to do this without the use of/\b/? We might want this since/\b/won’t treat underscores and numbers as word boundaries; but we might want to
find the in some context where it might also have underlines or numbers nearby (the
or the25) We need to specify that we want instances in which there are no alphabetic letters on either side of the the:
/[ˆa-zA-Z][tT]he[ˆa-zA-Z]/
But there is still one more problem with this pattern: it won’t find the word the
when it begins a line This is because the regular expression[ˆa-zA-Z], which we
used to avoid embedded thes, implies that there must be some single (although
non-alphabetic) character before thethe We can avoid this by specifying that before the
the we require either the beginning-of-line or a non-alphabetic character, and the same
at the end of the line:
/(ˆ|[ˆa-zA-Z])[tT]he([ˆa-zA-Z]|$)/
The process we just went through was based on fixing two kinds of errors: false
positives, strings that we incorrectly matched like other or there, and false negatives,
FALSE POSITIVES
FALSE NEGATIVES strings that we incorrectly missed, like The Addressing these two kinds of errors
comes up again and again in building and improving speech and language processingsystems Reducing the error rate for an application thus involves two antagonisticefforts:
• Increasing accuracy (minimizing false positives)
• Increasing coverage (minimizing false negatives).
2.1.4 A More Complex Example
Let’s try out a more significant example of the power of REs Suppose we want to build
an application to help a user buy a computer on the Web The user might want “any PCwith more than 500 MHz and 32 Gb of disk space for less than $1000” In order to do
this kind of retrieval we will first need to be able to look for expressions like 500 MHz
Trang 27D R
or 32 Gb or Compaq or Mac or$999.99 In the rest of this section we’ll work out some
simple regular expressions for this task
First, let’s complete our regular expression for prices Here’s a regular expressionfor a dollar sign followed by a string of digits Note that Perl is smart enough to realizethat$here doesn’t mean end-of-line; how might it know that?
/$[0-9]+/
Now we just need to deal with fractions of dollars We’ll add a decimal point andtwo digits afterwards:
/$[0-9]+\.[0-9][0-9]/
This pattern only allows$199.99 but not$199 We need to make the cents optional,
and make sure we’re at a word boundary:
/\b$[0-9]+(\.[0-9][0-9])?\b/
How about specifications for processor speed (in megahertz = MHz or gigahertz =GHz)? Here’s a pattern for that:
/\b[0-9]+ *(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b/
Note that we use/ */to mean “zero or more spaces”, since there might always
be extra spaces lying around Dealing with disk space (in Gb = gigabytes), or memorysize (in Mb = megabytes or Gb = gigabytes), we need to allow for optional gigabyte
fractions again (5.5 Gb) Note the use of?for making the finalsoptional:
\w [a-zA-Z0-9_] any alphanumeric or underscore Daiyu
Trang 28D R
There are also some useful advanced regular expression operators Fig 2.6 showssome useful aliases for common ranges, which can be used mainly to save typing.Besides the Kleene * and Kleene +, we can also use explicit numbers as counters, byenclosing them in curly brackets The regular expression /{3}/means “exactly 3occurrences of the previous character or expression” So/a\.{24}z/will match a followed by 24 dots followed by z (but not a followed by 23 or 25 dots followed by a z).
A range of numbers can also be specified; so/{n,m}/specifies from n to m rences of the previous char or expression, while/{n,}/means at least n occurrences
occur-of the previous expression REs for counting are summarized in Figure 2.7
{n,m} from n to m occurrences of the previous char or expression
{n,} at least n occurrences of the previous char or expression
Figure 2.7 Regular expression operators for counting
Finally, certain special characters are referred to by special notation based on thebackslash (\) The most common of these are the newline character\nand the tab
\ a period “.” “Dr Livingston, I presume”
\? a question mark “Why don’t they come and lend a hand?”
\n a newline
\t a tab
Figure 2.8 Some characters that need to be backslashed
The reader should consult Appendix A for further details of regular expressions,and especially for the differences between regular expressions in Perl, UNIX, and Mi-crosoft Word
2.1.6 Regular Expression Substitution, Memory, and ELIZA
An important use of regular expressions is in substitutions For example, the Perl
sub-SUBSTITUTION
stitution operators/regexp1/pattern/allows a string characterized by a regularexpression to be replaced by another string:
s/colour/color/
Trang 29D R
It is often useful to be able to refer to a particular subpart of the string matching thefirst pattern For example, suppose we wanted to put angle brackets around all integers
in a text, changing e.g., the 35 boxes to the <35> boxes We’d like a way to refer back
to the integer we’ve found so that we can easily add the brackets To do this, we putparentheses(and)around the first pattern, and use the number operator\1in thesecond pattern to refer back Here’s how it looks:
s/([0-9]+)/<\1>/
The parenthesis and number operators can also be used to specify that a certainstring or expression must occur twice in the text For example, suppose we are lookingfor the pattern “the Xer they were, the Xer they will be”, where we want to constrain thetwo X’s to be the same string We do this by surrounding the first X with the parenthesisoperator, and replacing the second X with the number operator\1, as follows:/the (.*)er they were, the \1er they will be/
Here the\1will be replaced by whatever string matched the first item in parentheses
So this will match The bigger they were, the bigger they will be but not The bigger they were, the faster they will be.
The number operator can be used with other numbers: if you match two differentsets of parenthesis,\2means whatever matched the second set For example
/the (.*)er they (.*), the \1er they \2/
will match The bigger they were, the bigger they were but not The bigger they were,
the bigger they will be These numbered memories are called registers (e.g register 1,
REGISTERS
register 2, register 3, etc) This memory feature is not part of every regular expressionlanguage and is often considered an “extended” feature of regular expressions.Substitutions using memory are very useful in implementing a simple natural-language understanding program like ELIZA (Weizenbaum, 1966) Recall that ELIZAsimulated a Rogerian psychologist and could carry on conversations with the user likethe following:
User1: Men are all alike
ELIZA1: IN WHAT WAYUser2: They’re always bugging us about something or other
ELIZA2: CAN YOU THINK OF A SPECIFIC EXAMPLEUser3: Well, my boyfriend made me come here
ELIZA3: YOUR BOYFRIEND MADE YOU COME HEREUser4: He says I’m depressed much of the time
ELIZA4: I AM SORRY TO HEAR YOU ARE DEPRESSED
Eliza worked by having a cascade of regular expression substitutions that eachmatched some part of the input lines and changed them The first substitutions changed
all instances of my to YOUR, and I’m to YOU ARE, and so on The next set of
substi-tutions looked for relevant patterns in the input and created an appropriate output; hereare some examples:
s/.* YOU ARE (depressed|sad) */I AM SORRY TO HEAR YOU ARE \1/s/.* YOU ARE (depressed|sad) */WHY DO YOU THINK YOU ARE \1/
Trang 30D R
s/.* all */IN WHAT WAY/
s/.* always */CAN YOU THINK OF A SPECIFIC EXAMPLE/
Since multiple substitutions could apply to a given input, substitutions were signed a rank and were applied in order Creation of such patterns is addressed inExercise 2.2
The regular expression is more than just a convenient metalanguage for text searching
First, a regular expression is one way of describing a finite-state automaton (FSA).
FINITE-STATE
AUTOMATON
FSA Finite-state automata are the theoretical foundation of a good deal of the computationalwork we will describe in this book Any regular expression can be implemented as afinite-state automaton (except regular expressions that use the memory feature; more
on this later) Symmetrically, any finite-state automaton can be described with a regularexpression Second, a regular expression is one way of characterizing a particular kind
of formal language called a regular language Both regular expressions and
finite-REGULAR LANGUAGE
state automata can be used to describe regular languages A third equivalent method
of characterizing the regular languages, the regular grammar, will be introduced in
Ch 15 The relation among these four theoretical constructions is sketched out inFig 2.9
regular grammars
finite automata
regular expressions
regular languages
Figure 2.9 Finite automata, regular expressions, and regular grammars are all lent ways of describing regular languages
equiva-This section will begin by introducing finite-state automata for some of the lar expressions from the last section, and then suggest how the mapping from regularexpressions to automata proceeds in general Although we begin with their use forimplementing regular expressions, FSAs have a wide variety of other uses that we willexplore in this chapter and the next
regu-2.2.1 Using an FSA to Recognize Sheeptalk
After a while, with the parrot’s help, the Doctor got to learn the language of the animals so well that he could talk to them himself and understand everything they said.
Trang 31D R
Hugh Lofting, The Story of Doctor Dolittle
Let’s begin with the “sheep language” we discussed previously Recall that wedefined the sheep language as any string from the following (infinite) set:
Figure 2.10 A finite-state automaton for talking sheep
The regular expression for this kind of “sheeptalk” is/baa+!/ Fig 2.10 shows
an automaton for modeling this regular expression The automaton (i.e., machine,
AUTOMATON
also called finite automaton, finite-state automaton, or FSA) recognizes a set of
strings, in this case the strings characterizing sheep talk, in the same way that a regularexpression does We represent the automaton as a directed graph: a finite set of vertices(also called nodes), together with a set of directed links between pairs of vertices calledarcs We’ll represent vertices with circles and arcs with arrows The automaton has five
statess, which are represented by nodes in the graph State 0 is the start state In our
STATES
START STATE examples state 0 will generally be the start state; to mark another state as the start state
we can add an incoming arrow to the start state State 4 is the final state or accepting
state, which we represent by the double circle It also has four transitions, which we
represent by arcs in the graph
The FSA can be used for recognizing (we also say accepting) strings in the
follow-ing way First, think of the input as befollow-ing written on a long tape broken up into cells,with one symbol written in each cell of the tape, as in Fig 2.11
q0
Figure 2.11 A tape with cells
The machine starts in the start state (q0), and iterates the following process: Checkthe next letter of the input If it matches the symbol on an arc leaving the currentstate, then cross that arc, move to the next state, and also advance one symbol in the
Trang 32D R
input If we are in the accepting state (q4) when we run out of input, the machine hassuccessfully recognized an instance of sheeptalk If the machine never gets to the finalstate, either because it runs out of input, or it gets some input that doesn’t match an arc(as in Fig 2.11), or if it just happens to get stuck in some non-final state, we say the
machine rejects or fails to accept an input.
InputState b a !
Figure 2.12 The state-transition table for the FSA of Figure 2.10
We’ve marked state 4 with a colon to indicate that it’s a final state (you can have asmany final states as you want), and the/0indicates an illegal or missing transition We
can read the first row as “if we’re in state 0 and we see the input b we must go to state
1 If we’re in state 0 and we see the input a or !, we fail”.
More formally, a finite automaton is defined by the following five parameters:
Q = q0q1q2 .q N−1 a finite set of N states
δ(q, i) the transition function or transition matrix
be-tween states Given a state q ∈ Q and an input
symbol i∈Σ,δ(q, i) returns a new state q′∈ Q.δ
is thus a relation from Q×Σto Q;
For the sheeptalk automaton in Fig 2.10, Q = {q0,q1,q2,q3,q4}, Σ= {a, b, !},
F = {q4}, andδ(q, i) is defined by the transition table in Fig 2.12
Figure 2.13 presents an algorithm for recognizing a string using a state-transitiontable The algorithm is calledD-RECOGNIZEfor “deterministic recognizer” A deter-
ministic algorithm is one that has no choice points; the algorithm always knows what
DETERMINISTIC
to do for any input The next section will introduce non-deterministic automata thatmust make decisions about which states to move to
D-RECOGNIZEtakes as input a tape and an automaton It returns accept if the string
it is pointing to on the tape is accepted by the automaton, and reject otherwise Note
that sinceD-RECOGNIZEassumes it is already pointing at the string to be checked, itstask is only a subpart of the general problem that we often use regular expressions for,
Trang 33D R
finding a string in a corpus (The general problem is left as an exercise to the reader inExercise 2.9.)
D-RECOGNIZEbegins by setting the variable index to the beginning of the tape, and current-state to the machine’s initial state.D-RECOGNIZEthen enters a loop that drivesthe rest of the algorithm It first checks whether it has reached the end of its input If
so, it either accepts the input (if the current state is an accept state) or rejects the input(if not)
If there is input left on the tape,D-RECOGNIZElooks at the transition table to decide
which state to move to The variable current-state indicates which row of the table to
consult, while the current symbol on the tape indicates which column of the table to
consult The resulting transition-table cell is used to update the variable current-state and index is incremented to move forward on the tape If the transition-table cell is
empty then the machine has nowhere to go and must reject the input
function D-RECOGNIZE(tape, machine) returns accept or reject
index← Beginning of tape
current-state← Initial state of machine
loop
if End of input has been reached then
if current-state is an accept state then return accept
else return reject
elsif transition-table[current-state,tape[index]] is empty then
return reject else
current-state ← transition-table[current-state,tape[index]]
index ← index + 1
end
Figure 2.13 An algorithm for deterministic recognition of FSAs This algorithm returns
accept if the entire string it is pointing at is in the language defined by the FSA, and reject
if the string is not in the language
Figure 2.14 traces the execution of this algorithm on the sheep language FSA given
the sample input string baaa!.
q0 q1 q2 q3 q3 q4
Figure 2.14 Tracing the execution of FSA #1 on some sheeptalk
Trang 34D R
Before examining the beginning of the tape, the machine is in state q0 Finding a b
on input tape, it changes to state q1as indicated by the contents of transition-table[q0,b]
in Fig 2.12 on page 13 It then finds an a and switches to state q2, another a puts it in state q3, a third a leaves it in state q3, where it reads the “!”, and switches to state q4.Since there is no more input, theEnd of inputcondition at the beginning of the
loop is satisfied for the first time and the machine halts in q4 State q4is an accepting
state, and so the machine has accepted the string baaa! as a sentence in the sheep
language
The algorithm will fail whenever there is no legal transition for a given combination
of state and input The input abc will fail to be recognized since there is no legal transition out of state q0on the inputa, (i.e., this entry of the transition table in Fig 2.12
on page 13 has a /0) Even if the automaton had allowed an initial a it would have certainly failed on c, since c isn’t even in the sheeptalk alphabet! We can think of these
“empty” elements in the table as if they all pointed at one “empty” state, which we
might call the fail state or sink state In a sense then, we could view any machine with
FAIL STATE
empty transitions as if we had augmented it with a fail state, and drawn in all the extra
arcs, so we always had somewhere to go from any state on any possible input Just for
completeness, Fig 2.15 shows the FSA from Figure 2.10 with the fail state q Ffilled in
a b
Figure 2.15 Adding a fail state to Fig 2.10
2.2.2 Formal Languages
We can use the same graph in Fig 2.10 as an automaton forGENERATINGsheeptalk
If we do, we would say that the automaton starts at state q0, and crosses arcs to newstates, printing out the symbols that label each arc it follows When the automaton gets
to the final state it stops Notice that at state 3, the automaton has to chose between
printing out a ! and going to state 4, or printing out an a and returning to state 3 Let’s
say for now that we don’t care how the machine makes this decision; maybe it flips acoin For now, we don’t care which exact string of sheeptalk we generate, as long asit’s a string captured by the regular expression for sheeptalk above
Formal Language: A model which can both generate and recognize all
and only the strings of a formal language acts as a definition of the formal
language
Trang 35D R
A formal language is a set of strings, each string composed of symbols from a
FORMAL LANGUAGE
finite symbol-set called an alphabet (the same alphabet used above for defining an
ALPHABET
automaton!) The alphabet for the sheep language is the set Σ= {a, b, !} Given a
model m (such as a particular FSA), we can use L (m) to mean “the formal language
characterized by m” So the formal language defined by our sheeptalk automaton m in
Fig 2.10 (and Fig 2.12) is the infinite set:
L (m) = {baa!, baaa!, baaaa!, baaaaa!, baaaaaa!, }
phonology, morphology, or syntax The term generative grammar is sometimes used
in linguistics to mean a grammar of a formal language; the origin of the term is this use
of an automaton to define a language by generating all possible strings
2.2.3 Another Example
In the previous examples our formal alphabet consisted of letters; but we can alsohave a higher level alphabet consisting of words In this way we can write finite-stateautomata that model facts about word combinations For example, suppose we wanted
to build an FSA that modeled the subpart of English dealing with amounts of money.Such a formal language would model the subset of English consisting of phrases like
ten cents, three dollars, one dollar thirty-five cents and so on.
We might break this down by first building just the automaton to account for thenumbers from 1 to 99, since we’ll need them to deal with cents Fig 2.16 shows this
twentythirtyfortyfifty
sixtyseventyeightyninety
onetwothreefourfive
sixseveneightnine
onetwothreefourfive
sixseveneightnineten
eleventwelvethirteenfourteen
fifteensixteenseventeeneighteennineteen
Figure 2.16 An FSA for the words for English numbers 1–99
We could now add cents and dollars to our automaton Fig 2.17 shows a simple
version of this, where we just made two copies of the automaton in Fig 2.16 and
Trang 36six seven eight nine
q3
twenty thirty forty fifty
sixty seventy eighty ninety
one two three four five
six seven eight nine
one two three four five
six seven eight nine
sixteen seventeen eighteen nineteen
ten twenty thirty forty fifty
sixty seventy eighty ninety
sixteen seventeen eighteen nineteen
eleven twelve thirteen fourteen fifteen
Figure 2.17 FSA for the simple dollars and cents
We would now need to add in the grammar for different amounts of dollars;
in-cluding higher numbers like hundred, thousand We’d also need to make sure that the nouns like cents and dollars are singular when appropriate (one cent, one dollar), and plural when appropriate (ten cents, two dollars) This is left as an exercise for the
reader (Exercise 2.3) We can think of the FSAs in Fig 2.16 and Fig 2.17 as simplegrammars of parts of English We will return to grammar-building in Part II of thisbook, particularly in Ch 12
2.2.4 Non-Deterministic FSAsLet’s extend our discussion now to another class of FSAs: non-deterministic FSAs (or NFSAs) Consider the sheeptalk automaton in Figure 2.18, which is much like our
first automaton in Figure 2.10:
as an automaton for recognizing sheeptalk When we get to state 2, if we see an a we
don’t know whether to remain in state 2 or go on to state 3 Automata with decision
points like this are called non-deterministic FSAs (or NFSAs) Recall by contrast
NON-DETERMINISTIC
NFSA that Figure 2.10 specified a deterministic automaton, i.e., one whose behavior during
recognition is fully determined by the state it is in and the symbol it is looking at A
deterministic automaton can be referred to as a DFSA That is not true for the machine
Trang 37D R
same language as the last one, or our first one, but it does it with anε-transition
Figure 2.19 Another NFSA for the sheep language (NFSA #2) It differs from NFSA
#1 in Fig 2.18 in having anε-transition
We interpret this new arc as follows: If we are in state 3, we are allowed to move
to state 2 without looking at the input, or advancing our input pointer So this
intro-duces another kind of non-determinism — we might not know whether to follow the
ε-transition or the ! arc.
2.2.5 Using an NFSA to Accept Strings
If we want to know whether a string is an instance of sheeptalk or not, and if we use anon-deterministic machine to recognize it, we might follow the wrong arc and reject itwhen we should have accepted it That is, since there is more than one choice at somepoint, we might take the wrong choice This problem of choice in non-deterministicmodels will come up again and again as we build computational models, particularly
for parsing There are three standard solutions to the problem of non-determinism:
• Backup: Whenever we come to a choice point, we could put a marker to mark
alternative path in parallel
We will focus here on the backup approach and defer discussion of the look-aheadand parallelism approaches to later chapters
The backup approach suggests that we should blithely make choices that mightlead to deadends, knowing that we can always return to unexplored alternative choices.There are two keys to this approach: we need to remember all the alternatives for eachchoice point, and we need to store sufficient information about each alternative so that
we can return to it when necessary When a backup algorithm reaches a point in itsprocessing where no progress can be made (because it runs out of input, or has nolegal transitions), it returns to a previous choice point, selects one of the unexploredalternatives, and continues from there Applying this notion to our non-deterministicrecognizer, we need only remember two things for each choice point: the state, or node,
of the machine that we can go to and the corresponding position on the tape We will
call the combination of the node and position the search-state of the recognition
Trang 38algo-D R
InputState b a ! ε
Figure 2.20 The transition table from NFSA #1 in Fig 2.18
rithm To avoid confusion, we will refer to the state of the automaton (as opposed to the
state of the search) as a node or a machine-state Figure 2.21 presents a recognition
algorithm based on this approach
Before going on to describe the main part of this algorithm, we should note twochanges to the transition table that drives it First, in order to represent nodes that haveoutgoingε-transitions, we add a newε-column to the transition table If a node has
anε-transition, we list the destination node in theε-column for that node’s row Thesecond addition is needed to account for multiple transitions to different nodes fromthe same input symbol We let each cell entry consist of a list of destination nodesrather than a single node Fig 2.20 shows the transition table for the machine in Figure2.18 (NFSA #1) While it has noε-transitions, it does show that in machine-state q2the input a can lead back to q2or on to q3
Fig 2.21 shows the algorithm for using a non-deterministic FSA to recognize aninput string The functionND-RECOGNIZEuses the variable agenda to keep track of
all the currently unexplored choices generated during the course of processing Eachchoice (search state) is a tuple consisting of a node (state) of the machine and a posi-
tion on the tape The variable current-search-state represents the branch choice being
currently explored
ND-RECOGNIZE begins by creating an initial search-state and placing it on theagenda For now we don’t specify what order the search-states are placed on theagenda This search-state consists of the initial machine-state of the machine and apointer to the beginning of the tape The functionNEXTis then called to retrieve an
item from the agenda and assign it to the variable current-search-state.
As withD-RECOGNIZE, the first task of the main loop is to determine if the tire contents of the tape have been successfully recognized This is done via a call
en-toACCEPT-STATE?, which returns accept if the current search-state contains both an
accepting machine-state and a pointer to the end of the tape If we’re not done, themachine generates a set of possible next steps by calling GENERATE-NEW-STATES,which creates search-states for anyε-transitions and any normal input-symbol transi-tions from the transition table All of these search-state tuples are then added to thecurrent agenda
Finally, we attempt to get a new search-state to process from the agenda If theagenda is empty we’ve run out of options and have to reject the input Otherwise, anunexplored option is selected and the loop continues
It is important to understand whyND-RECOGNIZEreturns a value of reject onlywhen the agenda is found to be empty UnlikeD-RECOGNIZE, it does not return reject
Trang 39D R
when it reaches the end of the tape in a non-accept machine-state or when it findsitself unable to advance the tape from some machine-state This is because, in the non-deterministic case, such roadblocks only indicate failure down a given path, not overallfailure We can only be sure we can reject a string when all possible choices have beenexamined and found lacking
function ND-RECOGNIZE(tape, machine) returns accept or reject
agenda← {(Initial state of machine, beginning of tape)}
agenda ← agenda ∪ GENERATE-NEW-STATES(current-search-state)
if agenda is empty then
return reject else
current-search-state← NEXT(agenda)
end function GENERATE-NEW-STATES(current-state) returns a set of search-states
current-node← the node the current search-state is in
index← the point on the tape the current search-state is looking at
return a list of search states from transition table as follows:
(transition-table[current-node,ε], index)
∪
(transition-table[current-node, tape[index]], index + 1)
function ACCEPT-STATE?(search-state) returns true or false
current-node← the node search-state is in
index← the point on the tape search-state is looking at
if index is at the end of the tape and current-node is an accept state of machine
then return true else return false
Figure 2.21 An algorithm for NFSA recognition The word node means a state of the FSA, while state or search-state means “the state of the search process”, i.e., a combination
of node and tape-position.
Figure 2.22 illustrates the progress ofND-RECOGNIZEas it attempts to handle theinputbaaa! Each strip illustrates the state of the algorithm at a given point in its
processing The current-search-state variable is captured by the solid bubbles
repre-senting the machine-state along with the arrow reprerepre-senting progress on the tape Each
strip lower down in the figure represents progress from one current-search-state to the
Trang 40Little of interest happens until the algorithm finds itself in state q2while looking at
the second a on the tape An examination of the entry for transition-table[q2,a] returns
both q2and q3 Search states are created for each of these choices and placed on the
agenda Unfortunately, our algorithm chooses to move to state q3, a move that results
in neither an accept state nor any new states since the entry for transition-table[q3, a]
is empty At this point, the algorithm simply asks the agenda for a new state to pursue
Since the choice of returning to q2from q2is the only unexamined choice on the agenda
it is returned with the tape pointer advanced to the next a Somewhat diabolically,ND
-RECOGNIZEfinds itself faced with the same choice The entry for transition-table[q2,a]
still indicates that looping back to q2or advancing to q3are valid choices As before,states representing both are placed on the agenda These search states are not the same
as the previous ones since their tape index values have advanced This time the agenda
provides the move to q3as the next move The move to q4, and success, is then uniquelydetermined by the tape and the transition-table