6 Statistical Inference: n-gram Models over Sparse Data 1917 Word Sense Disambiguation 229 I v Applications and Techniques 461 13 Statistical Alignment and Machine Translation 14 Cluster
Trang 1F o u n d a t i o n s
of Statistical Natural Language Processing
Trang 2Second printing, 1999
0 1999 Massachusetts Institute of Technology
Second printing with corrections, 2000
All rights reserved No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or informa- tion storage and retrieval) without permission in writing from the publisher Typeset in lo/13 Lucida Bright by the authors using ETPX2E.
Printed and bound in the United States of America.
Library of Congress Cataloging-in-Publication Information
Trang 36 Statistical Inference: n-gram Models over Sparse Data 191
7 Word Sense Disambiguation 229
I v Applications and Techniques 461
13 Statistical Alignment and Machine Translation
14 Clustering 495
15 Topics in Information Retrieval 529
16 Text Categorization 575
463
Trang 4p h e n o m e n a 1 51.3 The Ambiguity of Language: Why NLP Is Difficult 171.4 Dirty Hands 19
Trang 51.6 Exercises 35
2 Mathematical Foundations 39 2.1 Elementary Probability Theory 402.1.1 Probability spaces 402.1.2 Conditional probability and independence2.1.3 Bayes’ theorem 43
2.1.4 Random variables 4 52.1.5 Expectation and variance 462.1.6 Notation 4 7
2.1.7 Joint and conditional distributions 482.1.8 D e t e r m i n i n g P 48
2.1.9 Standard distributions 502.1.10 Bayesian statistics 542.1.11 E x e r c i s e s 5 9
2 2 8 P e r p l e x i t y 7 8
2 2 9 E x e r c i s e s 7 82.3 Further Reading 79
3 Linguistic Essentials 8 13.1 Parts of Speech and Morphology 8 13.1.1 Nouns and 83pronouns3.1.2 Words that accompany nouns: Determiners and
a d j e c t i v e s 8 73.1.3 Verbs 883.1.4 Other parts of speech 913.2 Phrase Structure 93
3.2.1 Phrase structure 96grammars3.2.2 Dependency: Arguments and adjuncts3.2.3 X’ theory 106
3.2.4 Phrase structure ambiguity 107
101
72
Trang 64.2.1 Low-level formatting issues 123
4.2.2 Tokenization: What is a word? 124
5.3.3 Pearson’s chi-square test 169
Trang 76.1.3 Building modelsn-gram 195
6.2 Statistical Estimators 196
6.2.1 Maximum Likelihood Estimation (MLE)
6.2.2 Laplace’s law, Lidstone’s law and the
Jeffreys-Perks law 2026.2.3 Held out estimation 205
6.2.4 Cross-validation (deleted estimation)
7.1.1 Supervised and unsupervised learning
7.1.2 Pseudowords 233
197
210
2327.1.3 Upper and lower bounds on performance 233
c o l l o c a t i o n 2 4 97.4 Unsupervised Disambiguation 252
7.5 What Is a Word Sense? 256
7.6 Further Reading 260
7.7 E x e r c i s e s 2 6 2
Trang 88.3.1 Hindle and Rooth (1993) 280
8.3.2 General remarks on PP attachment 284
9.3 The Three Fundamental Questions for HMMs 325
9.3.1 Finding the probability of an observation 326
9.3.2 Finding the best state sequence 331
9.3.3 The third problem: Parameter estimation 333
9.4 HMMs: Implementation, Properties, and Variants 336
9.4.1 Implementation 336
9.4.2 Variants 337
9.4.3 Multiple input observations 338
9.4.4 Initialization of parameter values 339
9.5 Further Reading 339
10 Part-of-Speech Tagging 341
10.1 The Information Sources in Tagging 343
10.2 Markov Model Taggers 345
10.2.1 The probabilistic model 345
10.2.2 The Viterbi algorithm 349
10.2.3 Variations 351
10.3 Hidden Markov Model Taggers 356
Trang 9xii Contents
10.3.1 Applying HMMs to POS tagging 35710.32 The effect of initialization on HMM training10.4 Transformation-Based Learning of Tags 36110.4.1 Transformations 362
10.4.2 The learning algorithm 36410.4.3 Relation to other models 365
1 0 4 4 A u t o m a t a 3 6 710.4.5 Summary 36910.5 Other Methods, Other Languages 37010.5.1 Other approaches to tagging 37010.5.2 Languages other than English 37110.6 Tagging Accuracy and Uses of Taggers 37110.6.1 Tagging 371accuracy
10.6.2 Applications of tagging 37410.7 Further Reading 377
11.4 Problems with the Inside-Outside Algorithm 40111.5 Further Reading 402
P C F G s 4 1 612.1.5 Tree probabilities and derivational probabilities 42112.1.6 There’s more than one way to do it 423
Trang 10Contents Xl11 .
121.7 Phrase structure grammars and dependency
g r a m m a r s 4 2 812.1.8 Evaluation 431
12.1.9 Equivalent models 437
1 2 1 1 0 B u i l d i n g S e a r c h m e t h o d s 4 3 9parsers:
12.1.11 Use of the geometric mean 442
12.2 Some Approaches 443
12.2.1 Non-lexicalized treebank 443grammars
12.2.2 Lexicalized models using derivational histories 44812.2.3 Dependency-based models 451
12.2.4 Discussion 454
12.3 Further Reading 456
12.4 E x e r c i s e s 4 5 8
IV Applications and Techniques 461
13 Statistical Alignment and Machine Translation 463
14.1.1 Single-link and complete-link clustering 503
14.1.2 Group-average agglomerative clustering 507
14.1.3 An application: Improving a language model 50914.1.4 Top-down clustering 512
14.2 Non-Hierarchical Clustering 514
14.2.1 K-means 515
14.2.2 The EM algorithm 518
14.3 Further Reading 527
Trang 1115.3.4 Inverse document frequency 55115.3.5 Residual inverse document frequency 55315.3.6 Usage of term distribution models 55415.4 Latent Semantic Indexing 554
15.4.1 Least-squares methods 55715.4.2 Singular Value Decomposition 55815.4.3 Latent Semantic Indexing in IR 56415.5 Discourse Segmentation 566
1 5 5 1 TextTiling 5 6 715.6 Further Reading 57015.7 Exercises 573
16 Text Categorization 575 16.1 Decision Trees 57816.2 Maximum Entropy Modeling 58916.2.1 Generalized iterative scaling 59116.2.2 Application to text categorization 59416.3 Perceptrons 597
16.4 k Nearest Neighbor Classification 60416.5 Further Reading 607
Tiny Statistical Tables 609 Bibliography 611
Index 657
Trang 12List of Tables
1.2 Frequency of frequencies of word types in Tom Sawyer. 22 1.3 Empirical evaluation of Zipf’s law on Tom Sawyer. 24 1.4 Commonest bigram collocations in the New York Times. 30
2.1 Likelihood ratios between two theories 58 2.2 Statistical NLP problems as decoding problems 71
4.1
4.2
Major suppliers of electronic corpora with contact URLs
Different formats for telephone numbers appearing in an
issue of The Economist.
119
4.3
4.4
4.5
Sentence lengths in newswire text
Sizes of various tag sets
131 137 140
4.6
Comparison of different tag sets: adjective, adverb,
conjunction, determiner, noun, and pronoun tags
Comparison of different tag sets: Verb, preposition,
punctuation and symbol tags
141 142 5.1
5.2
5.3
Part of speech tag patterns for collocation filtering 154
Finding Collocations: Justeson and Katz’ part-of-speech filter 15 5
Trang 13xvi List of Tables
5.4 5.5 5.6 5.7 5.8
5.9 5.10 5.11 5.12 5.13 5.14 5.15
5.16 5.17 5.18
6.1 6.2 6.3 6.4 6.5 6.6 6.7
The nouns w occurring most often in the patterns
‘strong w’ and ‘powerful w.’
Finding collocations based on mean and variance
Finding collocations: The t test applied to 10 bigrams thatoccur with frequency 20
Words that occur significantly more often with powerful
(the first ten words) and strong (the last ten words).
A 2-by-2 table showing the dependence of occurrences ofnew and companies.
Correspondence of vache and cow in an aligned corpus
Testing for the independence of words in different corporausing x2
How to compute Dunning’s likelihood ratio test
Bigrams of powerful with the highest scores according to
Dunning’s likelihood ratio test
Damerau’s frequency ratio test
Finding collocations: Ten bigrams that occur withfrequency 20, ranked according to mutual information
Correspondence of chambre and house and communeS and house in the aligned Hansard corpus.
Problems for Mutual Information from data sparseness
Different definitions of mutual information in (Cover and
Thomas 1991) and (Fano 1961)
Collocations in the BBI Combinatory Dictionary of Englishfor the words strength and power.
Growth in number of parameters for n-gram models
Notation for the statistical estimation chapter
Probabilities of each successive word for a clause from
Trang 14List of Tables xvii
Notational conventions used in this chapter 235
Clues for two senses of drug used by a Bayesian classifier. 238Highly informative indicators for three ambiguous French
Disambiguation of ash with Lesk’s algorithm 243Some results of thesaurus-based disambiguation 247
How to disambiguate interest using a second-language corpus 248
Examples of the one sense per discourse constraint 250Some results of unsupervised disambiguation 256The F measure and accuracy are different objective functions 270Some subcategorization frames with example verbs and
Some subcategorization frames learned by Manning’s system 276
An example where the simple model for resolving PP
Selectional Preference Strength (SPS) 290Association strength distinguishes a verb’s plausible and
Similarity measures for binary vectors 299The cosine as a measure of semantic similarity 302Measures of (dis-)similarity between probability distributions 304Types of words occurring in the LOB corpus that were not
Variable calculations for 0 = (lem, ice-t, cola) 330Some part-of-speech tags frequently used for tagging English 342
Trang 15.
xvlll
10.210.310.410.510.610.710.810.9
List of Tables
Idealized counts of some tag transitions in the Brown Corpus 348Idealized counts of tags that some words occur within the
Brown Corpus
Table of probabilities for dealing with unknown words intagging
Initialization of the parameters of an HMM
Triggering environments in Brill’s transformation-basedtagger
Examples of some transformations learned intransformation-based tagging
Examples of frequent errors of probabilistic taggers
10.10 A portion of a confusion matrix for part of speech tagging
11.111.211.312.112.2
Notation for the PCFG chapter
A simple Probabilistic Context Free Grammar (PCFG)
Calculation of inside probabilities
12.312.412.512.6
Abbreviations for phrasal categories in the Penn Treebank
Frequency of common subcategorization frames (local treesexpanding VP) for selected verbs
Selected common expansions of NP as Subject vs Object,ordered by log odds ratio
Selected common expansions of NP as first and secondobject inside VP
Precision and recall evaluation results for PP attachmenterrors for different styles of phrase structure
Comparison of some statistical parsing systems
13.1 Sentence alignment papers
14.1 A summary of the attributes of different clusteringalgorithms
14.2 Symbols used in the clustering chapter
14.3 Similarity functions used in clustering
14.4 An example of K-means clustering
14.5 An example of a Gaussian mixture
15.115.2
An example of the evaluation of rankings 535
349352359363363374375383384394413418420420436455470
500501503518521
Trang 16Components of tf.idf weighting schemes.
Document frequency (df) and collection frequency (cf) for
6 words in the New York Times corpus
Actual and estimated number of documents with k
occurrences for six terms
Example for exploiting co-occurrence in computing contentsimilarity
The matrix of document correlations BTB
Some examples of classification tasks in NLP.
Contingency table for evaluating a binary classifier
The representation of document 11, shown in figure 16.3
An example of information gain as a splitting criterion
Contingency table for a decision tree for the Reuters
category “earnings.”
An example of a maximum entropy distribution in the form
of equation (16.4)
An empirical distribution whose corresponding maximum
entropy distribution is the one in table 16.6
Feature weights in maximum entropy modeling for the
category “earnings” in Reuters
Classification results for the distribution corresponding to
table 16.8 on the test set
16.10 Perceptron for the “earnings” category
16.11 Classification results for the perceptron in table 16.10 on
the test set
16.12 Classification results for an 1NN categorizer for the
“earnings” category
xix
542542544547550554562576577581582586593594595595601602606
Trang 172.7 The noisy channel model.
2.8 A binary symmetric channel
2.9 The noisy channel model in linguistics
3.1 An example of recursive phrase structure expansion 99 3.2 An example of a prepositional phrase attachment ambiguity 1084.1 Heuristic sentence boundary detection algorithm
4.2 A sentence as tagged according to several different tag sets.5.1
Zipf’s law
Mandelbrot’s formula
Key Word In Context (KWIC) display for the word showed.
Syntactic frames for showed in Tom Sawyer.
A diagram illustrating the calculation of conditional
probability P(AJB).
A random variable X for the sum of two dice
Two examples of binomial distributions: b(r; 10,0.7) and
b(r; 10,O.l)
Example normal distribution curves: n(x; 0,l) and
n(x; 1.5,2)
The entropy of a weighted coin
The relationship between mutual information I and
entropy H
Using a three word collocational window to capture
bigrams at a distance
26273233
424552536367696970
135140
158
Trang 18Adaptive thesaurus-based disambiguation.
Disambiguation based on a second-language corpus
Disambiguation based on “one sense per collocation” and
“one sense per discourse.”
An EM algorithm for learning a word sense clustering
7.8
252 254 8.1 A diagram motivating the measures of precision and recall 268
The crazy soft drink machine, showing the states of the
machine and the state transition probabilities
A section of an HMM for a linearly interpolated language
Trellis algorithms: Closeup of the computation of forward
probabilities at one node
Algorithm for training a Visible Markov Model Tagger 348
Algorithm for tagging with a Visible Markov Model Tagger 350
The learning algorithm for transformation-based tagging 364
The two parse trees, their probabilities, and the sentence
probability
A Probabilistic Regular Grammar (PRG)
Inside and outside probabilities in PCFGs
385 390 391
Trang 19A Penn Treebank tree.
Two CFG derivations of the same tree.
An LC stack parser
Decomposing a local tree into dependencies
An example of the PARSEVAL measures
The idea of crossing brackets
Penn trees versus other trees
Different strategies for Machine Translation
Alignment and correspondence
Calculating the cost of alignments
A sample dot plot
The pillow-shaped envelope that is searched
The noisy channel model in machine translation
A single-link clustering of 22 frequent English words
represented as a dendrogram
Bottom-up hierarchical clustering
Top-down hierarchical clustering
A cloud of points in a plane
Intermediate clustering of the points in figure 14.4
Single-link clustering of the points in figure 14.4
Complete-link clustering of the points in figure 14.4
The K-means clustering algorithm
One iteration of the K-means algorithm
An example of using the EM algorithm for soft clustering
Results of the search ‘ “glass pyramid” Pei Louvre’ on an
internet search engine
Two examples of precision-recall curves
A vector space with two dimensions
The Poisson distribution
An example of a term-by-document matrix A.
Dimensionality reduction
An example of linear regression
The matrix T of the SVD decomposition of the matrix in
figure 15.5
The matrix of singular values of the SVD decomposition of
the matrix in figure 15.5
.
xx111
413 421 425 430 433 434 436 464 469 473 476 480 486
496 502 502 504 504 505 505 516 517 519
531 537 540 546 555 555 558 560 560
Trang 20xxiv List of Figures
15.10 The matrix D of the SVD decomposition of the matrix infigure 15.5
15.11 The matrix J3 = S2x2D2xn of documents after resealing withsingular values and reduction to two dimensions
15.12 Three constellations of cohesion scores in topic boundary
16.116.216.3
A decision tree
16.416.5
Geometric interpretation of part of the tree in figure 16.1
An example of a Reuters news story in the topic category
“earnings.”
Pruning a decision tree
Classification accuracy depends on the amount of trainingdata available
16.616.716.8
An example of how decision trees use data inefficientlyfrom the domain of phonological rule learning
The Perceptron Learning Algorithm
One error-correcting step of the perceptron learningalgorithm
16.9 Geometric interpretation of a perceptron
identification
561562569578579580585587588598600602
Trang 21The complement of set A
The empty set
The power set of A
Cardinality of a set
Sum
Product
p implies q (logical inference)
p and q are logically equivalent
Defined to be equal to (only used if “=” is ambiguous)The set of real numbers
N The set of natural numbers
n! The factorial of n
1x1 Absolute value of a number
<< Much smaller than
>> Much greater than
f : A - B A function f from values in A to B
m=f The maximum value of f
Trang 22xxvi Table of Notations
minf The minimum value of f
arg max f The argument for which f has its maximum valuearg min f The argument for which f has its minimum valuelim,,,f(x) The limit of f as x tends to infinity
CT22E(X)Var(X)puxs2P(AIB)
X - P(X)b(r; n,p)n0r
nk F, o1H(X)
f is proportional to gPartial derivativeIntegral
The logarithm of a
The exponential function
The smallest integer i s.t i 2 a
A real-valued vector: 2 E 08”
Euclidean length of 2The dot product of x’ and y’
The cosine of the angle between 2 and y’
Element in row i and column j of matrix C
Transpose of matrix CEstimate of X
Expectation of XVariance of XMean
Standard deviationSample meanSample variance
The probability of A conditional on B
Random variable X is distributed according to pThe binomial distribution
Combination or binomial coefficient (the number of ways of
choosing r objects from n)
The normal distributionEntropy
Trang 23Table of Notations xxvii
Count of the entity in parentheses
The relative frequency of u
The words Wi, wi+i, , Wj
The same as wij
The same as wij
Time complexity of an algorithm
Ungrammatical sentence or phrase or ill-formed wordMarginally grammatical sentence or marginally acceptablephrase
Note Some chapters have separate notation tables for symbols that areused locally: table 6.2 (Statistical Inference), table 7.1 (Word Sense Dis-ambiguation), table 9.1 (Markov Models), table 10.2 (Tagging), table 11.1(Probabilistic Context-Free Grammars), and table 14.2 (Clustering)
Trang 24T HE NEED for a thorough textbook for Statistical Natural Language cessing hardly needs to be argued for in the age of on-line information,electronic communication and the World Wide Web Increasingly, busi-nesses, government agencies and individuals are confronted with largeamounts of text that are critical for working and living, but not wellenough understood to get the enormous value out of them that they po-tentially hide
Pro-At the same time, the availability of large text corpora has changedthe scientific approach to language in linguistics and cognitive science.Phenomena that were not detectable or seemed uninteresting in studyingtoy domains and individual sentences have moved into the center field ofwhat is considered important to explain Whereas as recently as the early1990s quantitative methods were seen as so inadequate for linguisticsthat an important textbook for mathematical linguistics did not coverthem in any way, they are now increasingly seen as crucial for linguistictheory
In this book we have tried to achieve a balance between theory andpractice, and between intuition and rigor We attempt to ground ap-proaches in theoretical ideas, both mathematical and linguistic, but si-multaneously we try to not let the material get too dry, and try to showhow theoretical ideas have been used to solve practical problems To dothis, we first present key concepts in probability theory, statistics, infor-mation theory, and linguistics in order to give students the foundations
to understand the field and contribute to it Then we describe the lems that are addressed in Statistical Natural Language Processing (NLP),
prob-like tagging and disambiguation, and a selection of important work so
Trang 25xxx Preface
that students are grounded in the advances that have been made and,having understood the special problems that language poses, can movethe field forward
When we designed the basic structure of the book, we had to make
a number of decisions about what to include and how to organize thematerial A key criterion was to keep the book to a manageable size (Wedidn’t entirely succeed!) Thus the book is not a complete introduction
to probability theory, information theory, statistics, and the many otherareas of mathematics that are used in Statistical NLP We have tried tocover those topics that seem most important in the field, but there will
be many occasions when those teaching from the book will need to usesupplementary materials for a more in-depth coverage of mathematicalfoundations that are of particular interest
We also decided against attempting to present Statistical NLP as geneous in terms of the mathematical tools and theories that are used
homo-It is true that a unified underlying mathematical theory would be able, but such a theory simply does not exist at this point This has led
desir-to an eclectic mix in some places, but we believe that it is desir-too early desir-tomandate that a particular approach to NLP is right and should be givenpreference to others
A perhaps surprising decision is that we do not cover speech tion Speech recognition began as a separate field to NLP, mainly grow-ing out of electrical engineering departments, with separate conferencesand journals, and many of its own concerns However, in recent yearsthere has been increasing convergence and overlap It was research intospeech recognition that inspired the revival of statistical methods withinNLP, and many of the techniques that we present were developed first forspeech and then spread over into NLP In particular, work on languagemodels within speech recognition greatly overlaps with the discussion
recogni-of language models in this book Moreover, one can argue that speechrecognition is the area of language processing that currently is the mostsuccessful and the one that is most widely used in applications Neverthe-less, there are a number of practical reasons for excluding the area fromthis book: there are already several good textbooks for speech, it is not anarea in which we have worked or are terribly expert, and this book seemedquite long enough without including speech as well Additionally, whilethere is overlap, there is also considerable separation: a speech recogni-tion textbook requires thorough coverage of issues in signal analysis and
Trang 26Preface xxxi
acoustic modeling which would not generally be of interest or accessible
to someone from a computer science or NLP background, while in thereverse direction, most people studying speech would be uninterested inmany of the NLP topics on which we focus
Other related areas that have a somewhat fuzzy boundary with tical NLP are machine learning, text categorization, information retrieval,and cognitive science For all of these areas, one can find examples ofwork that is not covered and which would fit very well into the book
Statis-It was simply a matter of space that we did not include important cepts, methods and problems like minimum description length, back-propagation, the Rocchio algorithm, and the psychological and cognitive-science literature on frequency effects on language processing
con-The decisions that were most difficult for us to make are those thatconcern the boundary between statistical and non-statistical NLP Webelieve that, when we started the book, there was a clear dividing linebetween the two, but this line has become much more fuzzy recently
An increasing number of non-statistical researchers use corpus evidenceand incorporate quantitative methods And it is now generally accepted
in Statistical NLP that one needs to start with all the scientific knowledgethat is available about a phenomenon when building a probabilistic orother model, rather than closing one’s eyes and taking a clean-slate ap-proach
Many NLP researchers will therefore question the wisdom of writing aseparate textbook for the statistical side And the last thing we wouldwant to do with this textbook is to promote the unfortunate view insome quarters that linguistic theory and symbolic computational workare not relevant to Statistical NLP However, we believe that there is
so much quite complex foundational material to cover that one simplycannot write a textbook of a manageable size that is a satisfactory andcomprehensive introduction to all of NLP Again, other good texts al-ready exist, and we recommend using supplementary material if a morebalanced coverage of statistical and non-statistical methods is desired
A final remark is in order on the title we have chosen for this book
STATISTICALNATURAL Calling the field Statistical Nuturd Language Processing might seem
ques-L A N G U A G E tionable to someone who takes their definition of a statistical method
P ROCESSING
from a standard introduction to statistics Statistical NLP as we define itcomprises all quantitative approaches to automated language processing,including probabilistic modeling, information theory, and linear algebra
Trang 27While probability theory is the foundation for formal statistical ing, we take the basic meaning of the term ‘statistics’ as being broader,encompassing all quantitative approaches to data (a definition which onecan quickly confirm in almost any dictionary) Although there is thussome potential for ambiguity, Statistical NLP has been the most widelyused term to refer to non-symbolic and non-logical work on NLP over thepast decade, and we have decided to keep with this term
reason-Acknowledgments Over the course of the three years that we were
working on this book, a number of colleagues and friends have madecomments and suggestions on earlier drafts We would like to expressour gratitude to all of them, in particular, Einat Amitay, Chris Brew,Thorsten Brants, Andreas Eisele, Michael Ernst, Oren Etzioni, Marc Fried-man, Eric Gaussier, Eli Hagen, Marti Hearst, Nitin Indurkhya, MichaelInman, Mark Johnson, Rosie Jones, Tom Kalt, Andy Kehler, Julian Ku-piec, Michael Littman, Arman Maghbouleh, Amir Najmi, Kris Popat,Fred Popowich, Geoffrey Sampson, Hadar Shemtov, Scott Stoness, DavidYarowsky, and Jakub Zavrel We are particularly indebted to Bob Car-penter, Eugene Charniak, Raymond Mooney, and an anonymous reviewerfor MIT Press, who suggested a large number of improvements, both incontent and exposition, that we feel have greatly increased the overallquality and usability of the book We hope that they will sense our grat-itude when they notice ideas which we have taken from their commentswithout proper acknowledgement
We would like to also thank: Francine Chen, Kris Halvorsen, and rox PARC for supporting the second author while writing this book, JaneManning for her love and support of the first author, Robert Dale andDikran Karagueuzian for advice on book design, and Amy Brand for herregular help and assistance as our editor
Xe-Feedback While we have tried hard to make the contents of this book
understandable, comprehensive, and correct, there are doubtless manyplaces where we could have done better We welcome feedback to theauthors via email to cmanning@acm.org or hinrich@hotmail.com.
In closing, we can only hope that the availability of a book which lects many of the methods used within Statistical NLP and presents them
Trang 28col-P r e f a c e xxxlll
in an accessible fashion will create excitement in potential students, andhelp ensure continued rapid progress in the field
Christopher Manning Hinrich Schiitze February 1999
Trang 29Road Map
IN GENERAL, this book is ,written to be suitable for a graduate-levelsemester-long course focusing on Statistical NLP There is actually rathermore material than one could hope to cover in a semester, but that rich-ness gives ample room for the teacher to pick and choose It is assumedthat the student has prior programming experience, and has some famil-iarity with formal languages and symbolic parsing methods It is alsoassumed that the student has a basic grounding in such mathematicalconcepts as set theory, logarithms, vectors and matrices, summations,and integration - we hope nothing more than an adequate high schooleducation! The student may have already taken a course on symbolic NLPmethods, but a lot of background is not assumed In the directions ofprobability and statistics, and linguistics, we try to briefly summarize allthe necessary background, since in our experience many people wanting
to learn about Statistical NLP methods have no prior knowledge in theseareas (perhaps this will change over time!) Nevertheless, study of sup-plementary material in these areas is probably necessary for a student
to have an adequate foundation from which to build, and can only be ofvalue to the prospective researcher
What is the best way to read this book and teach from it? The book isorganized into four parts: Preliminaries (part I), Words (part II), Grammar(part III), and Applications and Techniques (part IV)
Part I lays out the mathematical and linguistic foundation that the otherparts build on Concepts and techniques introduced here are referred tothroughout the book
Part II covers word-centered work in Statistical NLP There is a ral progression from simple to complex linguistic phenomena in its four
Trang 30natu-mvi Road Map
chapters on collocations, n-gram models, word sense disambiguation,and lexical acquisition, but each chapter can also be read on its own.The four chapters in part III, Markov Models, tagging, probabilistic con-text free grammars, and probabilistic parsing, build on each other, and sothey are best presented in sequence However, the tagging chapter can beread separately with occasional references to the Markov Model chapter.The topics of part IV are four applications and techniques: statisti-cal alignment and machine translation, clustering, information retrieval,and text categorization Again, these chapters can be treated separatelyaccording to interests and time available, with the few dependencies be-tween them marked appropriately
Although we have organized the book with a lot of background andfoundational material in part I, we would not advise going through all of
it carefully at the beginning of a course based on this book What theauthors have generally done is to review the really essential bits of part I
in about the first 6 hours of a course This comprises very basic bility (through section 2.1.8), information theory (through section 2.2.71,and essential practical knowledge - some of which is contained in chap-ter 4, and some of which is the particulars of what is available at one’sown institution We have generally left the contents of chapter 3 as areading assignment for those without much background in linguistics.Some knowledge of linguistic concepts is needed in many chapters, but
proba-is particularly relevant to chapter 12, and the instructor may wproba-ish to view some syntactic concepts at this point Other material from the earlychapters is then introduced on a “need to know” basis during the course.The choice of topics in part II was partly driven by a desire to be able topresent accessible and interesting topics early in a course, in particular,ones which are also a good basis for student programming projects Wehave found collocations (chapter 51, word sense disambiguation (chap-ter 7), and attachment ambiguities (section 8.3) particularly successful inthis regard Early introduction of attachment ambiguities is also effec-tive in showing that there is a role for linguistic concepts and structures
re-in Statistical NLP Much of the material in chapter 6 is rather detailedreference material People interested in applications like speech or op-tical character recognition may wish to cover all of it, but if n-gramlanguage models are not a particular focus of interest, one may onlywant to read through section 6.2.3 This is enough to understand theconcept of likelihood, maximum likelihood estimates, a couple of simplesmoothing methods (usually necessary if students are to be building any
Trang 31Road Map xxxvii
probabilistic models on their own), and good methods for assessing theperformance of systems
In general, we have attempted to provide ample cross-references sothat, if desired, an instructor can present most chapters independentlywith incorporation of prior material where appropriate In particular, this
is the case for the chapters on collocations, lexical acquisition, tagging,and information retrieval
Exercises There are exercises scattered through or at the end of every
chapter They vary enormously in difficulty and scope We have tried toprovide an elementary classification as follows:
* Simple problems that range from text comprehension through tosuch things as mathematical manipulations, simple proofs, andthinking of examples of something
* * More substantial problems, many of which involve either ming or corpus investigations Many would be suitable as an as-signment to be done over two weeks
program-* program-* program-* Large, difficult, or open-ended problems Many would be suitable
as a term project
Website Finally, we encourage students and teachers to take advantage
WEBSITE of the material and the references on the companion website. It can be
accessed directly at the URL http://www.sultry.arts.usyd.edu.au/fsnlp, or
found through the MIT Press website http://mitpress.mit.edu, by ing for this book
Trang 32search-P ART I
Preliminaries
Trang 33“Statistical considerations are essential to an understanding of the operation and development of languages”
(Lyons 1968: 98)
“One’s ability to produce and recognize grammatical utterances
is not based on notions ofstatistical approximation and the like” (Chomsky 1957: 16)
“You say: the point isn’t the word, but its meaning, and you think of the meaning as a thing of the same kind as the word, though also different from the word Here the word, there the meaning The money, and the cow that you can buy with it (But contrast: money, and its use.)”
(Wittgenstein 1968, Philosophical Investigations, Jj120)
“For a large class of cases-though not for all-in which we employ the word ‘meaning’ it can be defined thus: the meaning
of a word is its use in the language ” (Wittgenstein 1968, 943)
“Now isn‘t it queer that I say that the word ‘is’ is used with two
different meanings (as the copula and as the sign of equality),
and should not care to say that its meaning is its use; its use, that is, as the copula and the sign of equality?”
(Wittgenstein 1968, 3561)
Trang 341 Introduction
T HE AIM of a linguistic science is to be able to characterize and explainthe multitude of linguistic observations circling around us, in conversa-tions, writing, and other media Part of that has to do with the cognitiveside of how humans acquire, produce, and understand language, part
of it has to do with understanding the relationship between linguisticutterances and the world, and part of it has to do with understandingthe linguistic structures by which language communicates In order to
which are used to structure linguistic expressions This basic approach
has a long history that extends back at least 2000 years, but in this
cen-tury the approach became increasingly formal and rigorous as linguistsexplored detailed grammars that attempted to describe what were well-formed versus ill-formed utterances of a language
However, it has become apparent that there is a problem with this ception Indeed it was noticed early on by Edward Sapir, who summed it
con-up in his famous quote “All grammars leak” (Sapir 1921: 38) It is justnot possible to provide an exact and complete characterization of well-formed utterances that cleanly divides them from all other sequences
of words, which are regarded as ill-formed utterances This is becausepeople are always stretching and bending the ‘rules’ to meet their com-municative needs Nevertheless, it is certainly not the case that the rulesare completely ill-founded Syntactic rules for a language, such as that abasic English noun phrase consists of an optional determiner, some num-ber of adjectives, and then a noun, do capture major patterns within thelanguage But somehow we need to make things looser, in accounting forthe creativity of language use
Trang 351 Introduction
This book explores an approach that addresses this problem head on.Rather than starting off by dividing sentences into grammatical and un-grammatical ones, we instead ask, “What are the common patterns thatoccur in language use?” The major tool which we use to identify thesepatterns is counting things, otherwise known as statistics, and so the sci-entific foundation of the book is found in probability theory Moreover,
we are not merely going to approach this issue as a scientific question,but rather we wish to show how statistical models of language are builtand successfully used for many natural language processing (NLP) tasks.While practical utility is something different from the validity of a the-ory, the usefulness of statistical models of language tends to confirmthat there is something right about the basic approach
Adopting a Statistical NLP approach requires mastering a fair number
of theoretical tools, but before we delve into a lot of theory, this chapterspends a bit of time attempting to situate the approach to natural lan-guage processing that we pursue in this book within a broader context.One should first have some idea about why many people are adopting
a statistical approach to natural language processing and of how oneshould go about this enterprise So, in this first chapter, we examine some
of the philosophical themes and leading ideas that motivate a statisticalapproach to linguistics and NLP, and then proceed to get our hands dirty
by beginning an exploration of what one can learn by looking at statisticsover texts
1.1 Rationalist and Empiricist Approaches to Language
Some language researchers and many NLP practitioners are perfectlyhappy to just work on text without thinking much about the relationshipbetween the mental representation of language and its manifestation inwritten form Readers sympathetic with this approach may feel like skip-ping to the practical sections, but even practically-minded people have
to confront the issue of what prior knowledge to try to build into theirmodel, even if this prior knowledge might be clearly different from whatmight be plausibly hypothesized for the brain This section briefly dis-cusses the philosophical issues that underlie this question
Between about 1960 and 1985, most of linguistics, psychology, cial intelligence, and natural language processing was completely domi-
artifi-RATIONALIST nated by a rationalist approach A rationalist approach is characterized