manning schuetze statisticalnlp phần 1 pdf

6 Statistical Inference: n-gram Models over Sparse Data 1917 Word Sense Disambiguation 229 I v Applications and Techniques 461 13 Statistical Alignment and Machine Translation 14 Cluster

Trang 1

F o u n d a t i o n s

of Statistical Natural Language Processing

Trang 2

Second printing, 1999

0 1999 Massachusetts Institute of Technology

Second printing with corrections, 2000

All rights reserved No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher Typeset in lo/13 Lucida Bright by the authors using ETPX2E.

Printed and bound in the United States of America.

Library of Congress Cataloging-in-Publication Information

Trang 3

6 Statistical Inference: n-gram Models over Sparse Data 191

7 Word Sense Disambiguation 229

I v Applications and Techniques 461

13 Statistical Alignment and Machine Translation

14 Clustering 495

15 Topics in Information Retrieval 529

16 Text Categorization 575

463

Trang 4

p h e n o m e n a 1 51.3 The Ambiguity of Language: Why NLP Is Difficult 171.4 Dirty Hands 19

Trang 5

1.6 Exercises 35

2 Mathematical Foundations 39 2.1 Elementary Probability Theory 402.1.1 Probability spaces 402.1.2 Conditional probability and independence2.1.3 Bayes’ theorem 43

2.1.4 Random variables 4 52.1.5 Expectation and variance 462.1.6 Notation 4 7

2.1.7 Joint and conditional distributions 482.1.8 D e t e r m i n i n g P 48

2.1.9 Standard distributions 502.1.10 Bayesian statistics 542.1.11 E x e r c i s e s 5 9

2 2 8 P e r p l e x i t y 7 8

2 2 9 E x e r c i s e s 7 82.3 Further Reading 79

3 Linguistic Essentials 8 13.1 Parts of Speech and Morphology 8 13.1.1 Nouns and 83pronouns3.1.2 Words that accompany nouns: Determiners and

a d j e c t i v e s 8 73.1.3 Verbs 883.1.4 Other parts of speech 913.2 Phrase Structure 93

3.2.1 Phrase structure 96grammars3.2.2 Dependency: Arguments and adjuncts3.2.3 X’ theory 106

3.2.4 Phrase structure ambiguity 107

101

72

Trang 6

4.2.1 Low-level formatting issues 123

4.2.2 Tokenization: What is a word? 124

5.3.3 Pearson’s chi-square test 169

Trang 7

6.1.3 Building modelsn-gram 195

6.2 Statistical Estimators 196

6.2.1 Maximum Likelihood Estimation (MLE)

6.2.2 Laplace’s law, Lidstone’s law and the

Jeffreys-Perks law 2026.2.3 Held out estimation 205

6.2.4 Cross-validation (deleted estimation)

7.1.1 Supervised and unsupervised learning

7.1.2 Pseudowords 233

197

210

2327.1.3 Upper and lower bounds on performance 233

c o l l o c a t i o n 2 4 97.4 Unsupervised Disambiguation 252

7.5 What Is a Word Sense? 256

7.6 Further Reading 260

7.7 E x e r c i s e s 2 6 2

Trang 8

8.3.1 Hindle and Rooth (1993) 280

8.3.2 General remarks on PP attachment 284

9.3 The Three Fundamental Questions for HMMs 325

9.3.1 Finding the probability of an observation 326

9.3.2 Finding the best state sequence 331

9.3.3 The third problem: Parameter estimation 333

9.4 HMMs: Implementation, Properties, and Variants 336

9.4.1 Implementation 336

9.4.2 Variants 337

9.4.3 Multiple input observations 338

9.4.4 Initialization of parameter values 339

10 Part-of-Speech Tagging 341

10.1 The Information Sources in Tagging 343

10.2 Markov Model Taggers 345

10.2.1 The probabilistic model 345

10.2.2 The Viterbi algorithm 349

10.2.3 Variations 351

10.3 Hidden Markov Model Taggers 356

Trang 9

xii Contents

10.3.1 Applying HMMs to POS tagging 35710.32 The effect of initialization on HMM training10.4 Transformation-Based Learning of Tags 36110.4.1 Transformations 362

10.4.2 The learning algorithm 36410.4.3 Relation to other models 365

1 0 4 4 A u t o m a t a 3 6 710.4.5 Summary 36910.5 Other Methods, Other Languages 37010.5.1 Other approaches to tagging 37010.5.2 Languages other than English 37110.6 Tagging Accuracy and Uses of Taggers 37110.6.1 Tagging 371accuracy

10.6.2 Applications of tagging 37410.7 Further Reading 377

11.4 Problems with the Inside-Outside Algorithm 40111.5 Further Reading 402

P C F G s 4 1 612.1.5 Tree probabilities and derivational probabilities 42112.1.6 There’s more than one way to do it 423

Trang 10

Contents Xl11 .

121.7 Phrase structure grammars and dependency

g r a m m a r s 4 2 812.1.8 Evaluation 431

12.1.9 Equivalent models 437

1 2 1 1 0 B u i l d i n g S e a r c h m e t h o d s 4 3 9parsers:

12.1.11 Use of the geometric mean 442

12.2 Some Approaches 443

12.2.1 Non-lexicalized treebank 443grammars

12.2.2 Lexicalized models using derivational histories 44812.2.3 Dependency-based models 451

12.2.4 Discussion 454

12.4 E x e r c i s e s 4 5 8

IV Applications and Techniques 461

13 Statistical Alignment and Machine Translation 463

14.1.1 Single-link and complete-link clustering 503

14.1.2 Group-average agglomerative clustering 507

14.1.3 An application: Improving a language model 50914.1.4 Top-down clustering 512

14.2 Non-Hierarchical Clustering 514

14.2.1 K-means 515

14.2.2 The EM algorithm 518

Trang 11

15.3.4 Inverse document frequency 55115.3.5 Residual inverse document frequency 55315.3.6 Usage of term distribution models 55415.4 Latent Semantic Indexing 554

15.4.1 Least-squares methods 55715.4.2 Singular Value Decomposition 55815.4.3 Latent Semantic Indexing in IR 56415.5 Discourse Segmentation 566

1 5 5 1 TextTiling 5 6 715.6 Further Reading 57015.7 Exercises 573

16 Text Categorization 575 16.1 Decision Trees 57816.2 Maximum Entropy Modeling 58916.2.1 Generalized iterative scaling 59116.2.2 Application to text categorization 59416.3 Perceptrons 597

16.4 k Nearest Neighbor Classification 60416.5 Further Reading 607

Tiny Statistical Tables 609 Bibliography 611

Index 657

Trang 12

List of Tables

1.2 Frequency of frequencies of word types in Tom Sawyer. 22 1.3 Empirical evaluation of Zipf’s law on Tom Sawyer. 24 1.4 Commonest bigram collocations in the New York Times. 30

2.1 Likelihood ratios between two theories 58 2.2 Statistical NLP problems as decoding problems 71

4.1

4.2

Major suppliers of electronic corpora with contact URLs

Different formats for telephone numbers appearing in an

issue of The Economist.

119

4.3

4.4

4.5

Sentence lengths in newswire text

Sizes of various tag sets

131 137 140

4.6

Comparison of different tag sets: adjective, adverb,

conjunction, determiner, noun, and pronoun tags

Comparison of different tag sets: Verb, preposition,

punctuation and symbol tags

141 142 5.1

5.2

5.3

Part of speech tag patterns for collocation filtering 154

Finding Collocations: Justeson and Katz’ part-of-speech filter 15 5

Trang 13

xvi List of Tables

5.4 5.5 5.6 5.7 5.8

5.9 5.10 5.11 5.12 5.13 5.14 5.15

5.16 5.17 5.18

6.1 6.2 6.3 6.4 6.5 6.6 6.7

The nouns w occurring most often in the patterns

‘strong w’ and ‘powerful w.’

Finding collocations based on mean and variance

Finding collocations: The t test applied to 10 bigrams thatoccur with frequency 20

Words that occur significantly more often with powerful

(the first ten words) and strong (the last ten words).

A 2-by-2 table showing the dependence of occurrences ofnew and companies.

Correspondence of vache and cow in an aligned corpus

Testing for the independence of words in different corporausing x2

How to compute Dunning’s likelihood ratio test

Bigrams of powerful with the highest scores according to

Dunning’s likelihood ratio test

Damerau’s frequency ratio test

Finding collocations: Ten bigrams that occur withfrequency 20, ranked according to mutual information

Correspondence of chambre and house and communeS and house in the aligned Hansard corpus.

Problems for Mutual Information from data sparseness

Different definitions of mutual information in (Cover and

Thomas 1991) and (Fano 1961)

Collocations in the BBI Combinatory Dictionary of Englishfor the words strength and power.

Growth in number of parameters for n-gram models

Notation for the statistical estimation chapter

Probabilities of each successive word for a clause from

Trang 14

List of Tables xvii

Notational conventions used in this chapter 235

Clues for two senses of drug used by a Bayesian classifier. 238Highly informative indicators for three ambiguous French

Disambiguation of ash with Lesk’s algorithm 243Some results of thesaurus-based disambiguation 247

How to disambiguate interest using a second-language corpus 248

Examples of the one sense per discourse constraint 250Some results of unsupervised disambiguation 256The F measure and accuracy are different objective functions 270Some subcategorization frames with example verbs and

Some subcategorization frames learned by Manning’s system 276

An example where the simple model for resolving PP

Selectional Preference Strength (SPS) 290Association strength distinguishes a verb’s plausible and

Similarity measures for binary vectors 299The cosine as a measure of semantic similarity 302Measures of (dis-)similarity between probability distributions 304Types of words occurring in the LOB corpus that were not

Variable calculations for 0 = (lem, ice-t, cola) 330Some part-of-speech tags frequently used for tagging English 342

Trang 15

.

xvlll

10.210.310.410.510.610.710.810.9

List of Tables

Idealized counts of some tag transitions in the Brown Corpus 348Idealized counts of tags that some words occur within the

Brown Corpus

Table of probabilities for dealing with unknown words intagging

Initialization of the parameters of an HMM

Triggering environments in Brill’s transformation-basedtagger

Examples of some transformations learned intransformation-based tagging

Examples of frequent errors of probabilistic taggers

10.10 A portion of a confusion matrix for part of speech tagging

11.111.211.312.112.2

Notation for the PCFG chapter

A simple Probabilistic Context Free Grammar (PCFG)

Calculation of inside probabilities

12.312.412.512.6

Abbreviations for phrasal categories in the Penn Treebank

Frequency of common subcategorization frames (local treesexpanding VP) for selected verbs

Selected common expansions of NP as Subject vs Object,ordered by log odds ratio

Selected common expansions of NP as first and secondobject inside VP

Precision and recall evaluation results for PP attachmenterrors for different styles of phrase structure

Comparison of some statistical parsing systems

13.1 Sentence alignment papers

14.1 A summary of the attributes of different clusteringalgorithms

14.2 Symbols used in the clustering chapter

14.3 Similarity functions used in clustering

14.4 An example of K-means clustering

14.5 An example of a Gaussian mixture

15.115.2

An example of the evaluation of rankings 535

349352359363363374375383384394413418420420436455470

500501503518521

Trang 16

Components of tf.idf weighting schemes.

Document frequency (df) and collection frequency (cf) for

6 words in the New York Times corpus

Actual and estimated number of documents with k

occurrences for six terms

Example for exploiting co-occurrence in computing contentsimilarity

The matrix of document correlations BTB

Some examples of classification tasks in NLP.

Contingency table for evaluating a binary classifier

The representation of document 11, shown in figure 16.3

An example of information gain as a splitting criterion

Contingency table for a decision tree for the Reuters

category “earnings.”

An example of a maximum entropy distribution in the form

of equation (16.4)

An empirical distribution whose corresponding maximum

entropy distribution is the one in table 16.6

Feature weights in maximum entropy modeling for the

category “earnings” in Reuters

Classification results for the distribution corresponding to

table 16.8 on the test set

16.10 Perceptron for the “earnings” category

16.11 Classification results for the perceptron in table 16.10 on

the test set

16.12 Classification results for an 1NN categorizer for the

“earnings” category

xix

542542544547550554562576577581582586593594595595601602606

Trang 17

2.7 The noisy channel model.

2.8 A binary symmetric channel

2.9 The noisy channel model in linguistics

3.1 An example of recursive phrase structure expansion 99 3.2 An example of a prepositional phrase attachment ambiguity 1084.1 Heuristic sentence boundary detection algorithm

4.2 A sentence as tagged according to several different tag sets.5.1

Zipf’s law

Mandelbrot’s formula

Key Word In Context (KWIC) display for the word showed.

Syntactic frames for showed in Tom Sawyer.

A diagram illustrating the calculation of conditional

probability P(AJB).

A random variable X for the sum of two dice

Two examples of binomial distributions: b(r; 10,0.7) and

b(r; 10,O.l)

Example normal distribution curves: n(x; 0,l) and

n(x; 1.5,2)

The entropy of a weighted coin

The relationship between mutual information I and

entropy H

Using a three word collocational window to capture

bigrams at a distance

26273233

424552536367696970

135140

158

Trang 18

Adaptive thesaurus-based disambiguation.

Disambiguation based on a second-language corpus

Disambiguation based on “one sense per collocation” and

“one sense per discourse.”

An EM algorithm for learning a word sense clustering

7.8

252 254 8.1 A diagram motivating the measures of precision and recall 268

The crazy soft drink machine, showing the states of the

machine and the state transition probabilities

A section of an HMM for a linearly interpolated language

Trellis algorithms: Closeup of the computation of forward

probabilities at one node

Algorithm for training a Visible Markov Model Tagger 348

Algorithm for tagging with a Visible Markov Model Tagger 350

The learning algorithm for transformation-based tagging 364

The two parse trees, their probabilities, and the sentence

probability

A Probabilistic Regular Grammar (PRG)

Inside and outside probabilities in PCFGs

385 390 391

Trang 19

A Penn Treebank tree.

Two CFG derivations of the same tree.

An LC stack parser

Decomposing a local tree into dependencies

An example of the PARSEVAL measures

The idea of crossing brackets

Penn trees versus other trees

Different strategies for Machine Translation

Alignment and correspondence

Calculating the cost of alignments

A sample dot plot

The pillow-shaped envelope that is searched

The noisy channel model in machine translation

A single-link clustering of 22 frequent English words

represented as a dendrogram

Bottom-up hierarchical clustering

Top-down hierarchical clustering

A cloud of points in a plane

Intermediate clustering of the points in figure 14.4

Single-link clustering of the points in figure 14.4

Complete-link clustering of the points in figure 14.4

The K-means clustering algorithm

One iteration of the K-means algorithm

An example of using the EM algorithm for soft clustering

Results of the search ‘ “glass pyramid” Pei Louvre’ on an

internet search engine

Two examples of precision-recall curves

A vector space with two dimensions

The Poisson distribution

An example of a term-by-document matrix A.

Dimensionality reduction

An example of linear regression

The matrix T of the SVD decomposition of the matrix in

figure 15.5

The matrix of singular values of the SVD decomposition of

the matrix in figure 15.5

.

xx111

413 421 425 430 433 434 436 464 469 473 476 480 486

496 502 502 504 504 505 505 516 517 519

531 537 540 546 555 555 558 560 560

Trang 20

xxiv List of Figures

15.10 The matrix D of the SVD decomposition of the matrix infigure 15.5

15.11 The matrix J3 = S2x2D2xn of documents after resealing withsingular values and reduction to two dimensions

15.12 Three constellations of cohesion scores in topic boundary

16.116.216.3

A decision tree

16.416.5

Geometric interpretation of part of the tree in figure 16.1

An example of a Reuters news story in the topic category

“earnings.”

Pruning a decision tree

Classification accuracy depends on the amount of trainingdata available

16.616.716.8

An example of how decision trees use data inefficientlyfrom the domain of phonological rule learning

The Perceptron Learning Algorithm

One error-correcting step of the perceptron learningalgorithm

16.9 Geometric interpretation of a perceptron

identification

561562569578579580585587588598600602

Trang 21

The complement of set A

The empty set

The power set of A

Cardinality of a set

Sum

Product

p implies q (logical inference)

p and q are logically equivalent

Defined to be equal to (only used if “=” is ambiguous)The set of real numbers

N The set of natural numbers

n! The factorial of n

1x1 Absolute value of a number

<< Much smaller than

>> Much greater than

f : A - B A function f from values in A to B

m=f The maximum value of f

Trang 22

xxvi Table of Notations

minf The minimum value of f

arg max f The argument for which f has its maximum valuearg min f The argument for which f has its minimum valuelim,,,f(x) The limit of f as x tends to infinity

CT22E(X)Var(X)puxs2P(AIB)

X - P(X)b(r; n,p)n0r

nk F, o1H(X)

f is proportional to gPartial derivativeIntegral

The logarithm of a

The exponential function

The smallest integer i s.t i 2 a

A real-valued vector: 2 E 08”

Euclidean length of 2The dot product of x’ and y’

The cosine of the angle between 2 and y’

Element in row i and column j of matrix C

Transpose of matrix CEstimate of X

Expectation of XVariance of XMean

Standard deviationSample meanSample variance

The probability of A conditional on B

Random variable X is distributed according to pThe binomial distribution

Combination or binomial coefficient (the number of ways of

choosing r objects from n)

The normal distributionEntropy

Trang 23

Table of Notations xxvii

Count of the entity in parentheses

The relative frequency of u

The words Wi, wi+i, , Wj

The same as wij

Time complexity of an algorithm

Ungrammatical sentence or phrase or ill-formed wordMarginally grammatical sentence or marginally acceptablephrase

Note Some chapters have separate notation tables for symbols that areused locally: table 6.2 (Statistical Inference), table 7.1 (Word Sense Dis-ambiguation), table 9.1 (Markov Models), table 10.2 (Tagging), table 11.1(Probabilistic Context-Free Grammars), and table 14.2 (Clustering)

Trang 24

T HE NEED for a thorough textbook for Statistical Natural Language cessing hardly needs to be argued for in the age of on-line information,electronic communication and the World Wide Web Increasingly, busi-nesses, government agencies and individuals are confronted with largeamounts of text that are critical for working and living, but not wellenough understood to get the enormous value out of them that they po-tentially hide

Pro-At the same time, the availability of large text corpora has changedthe scientific approach to language in linguistics and cognitive science.Phenomena that were not detectable or seemed uninteresting in studyingtoy domains and individual sentences have moved into the center field ofwhat is considered important to explain Whereas as recently as the early1990s quantitative methods were seen as so inadequate for linguisticsthat an important textbook for mathematical linguistics did not coverthem in any way, they are now increasingly seen as crucial for linguistictheory

In this book we have tried to achieve a balance between theory andpractice, and between intuition and rigor We attempt to ground ap-proaches in theoretical ideas, both mathematical and linguistic, but si-multaneously we try to not let the material get too dry, and try to showhow theoretical ideas have been used to solve practical problems To dothis, we first present key concepts in probability theory, statistics, infor-mation theory, and linguistics in order to give students the foundations

to understand the field and contribute to it Then we describe the lems that are addressed in Statistical Natural Language Processing (NLP),

prob-like tagging and disambiguation, and a selection of important work so

Trang 25

xxx Preface

that students are grounded in the advances that have been made and,having understood the special problems that language poses, can movethe field forward

When we designed the basic structure of the book, we had to make

a number of decisions about what to include and how to organize thematerial A key criterion was to keep the book to a manageable size (Wedidn’t entirely succeed!) Thus the book is not a complete introduction

to probability theory, information theory, statistics, and the many otherareas of mathematics that are used in Statistical NLP We have tried tocover those topics that seem most important in the field, but there will

be many occasions when those teaching from the book will need to usesupplementary materials for a more in-depth coverage of mathematicalfoundations that are of particular interest

We also decided against attempting to present Statistical NLP as geneous in terms of the mathematical tools and theories that are used

homo-It is true that a unified underlying mathematical theory would be able, but such a theory simply does not exist at this point This has led

desir-to an eclectic mix in some places, but we believe that it is desir-too early desir-tomandate that a particular approach to NLP is right and should be givenpreference to others

A perhaps surprising decision is that we do not cover speech tion Speech recognition began as a separate field to NLP, mainly grow-ing out of electrical engineering departments, with separate conferencesand journals, and many of its own concerns However, in recent yearsthere has been increasing convergence and overlap It was research intospeech recognition that inspired the revival of statistical methods withinNLP, and many of the techniques that we present were developed first forspeech and then spread over into NLP In particular, work on languagemodels within speech recognition greatly overlaps with the discussion

recogni-of language models in this book Moreover, one can argue that speechrecognition is the area of language processing that currently is the mostsuccessful and the one that is most widely used in applications Neverthe-less, there are a number of practical reasons for excluding the area fromthis book: there are already several good textbooks for speech, it is not anarea in which we have worked or are terribly expert, and this book seemedquite long enough without including speech as well Additionally, whilethere is overlap, there is also considerable separation: a speech recogni-tion textbook requires thorough coverage of issues in signal analysis and

Trang 26

Preface xxxi

acoustic modeling which would not generally be of interest or accessible

to someone from a computer science or NLP background, while in thereverse direction, most people studying speech would be uninterested inmany of the NLP topics on which we focus

Other related areas that have a somewhat fuzzy boundary with tical NLP are machine learning, text categorization, information retrieval,and cognitive science For all of these areas, one can find examples ofwork that is not covered and which would fit very well into the book

Statis-It was simply a matter of space that we did not include important cepts, methods and problems like minimum description length, back-propagation, the Rocchio algorithm, and the psychological and cognitive-science literature on frequency effects on language processing

con-The decisions that were most difficult for us to make are those thatconcern the boundary between statistical and non-statistical NLP Webelieve that, when we started the book, there was a clear dividing linebetween the two, but this line has become much more fuzzy recently

An increasing number of non-statistical researchers use corpus evidenceand incorporate quantitative methods And it is now generally accepted

in Statistical NLP that one needs to start with all the scientific knowledgethat is available about a phenomenon when building a probabilistic orother model, rather than closing one’s eyes and taking a clean-slate ap-proach

Many NLP researchers will therefore question the wisdom of writing aseparate textbook for the statistical side And the last thing we wouldwant to do with this textbook is to promote the unfortunate view insome quarters that linguistic theory and symbolic computational workare not relevant to Statistical NLP However, we believe that there is

so much quite complex foundational material to cover that one simplycannot write a textbook of a manageable size that is a satisfactory andcomprehensive introduction to all of NLP Again, other good texts al-ready exist, and we recommend using supplementary material if a morebalanced coverage of statistical and non-statistical methods is desired

A final remark is in order on the title we have chosen for this book

STATISTICALNATURAL Calling the field Statistical Nuturd Language Processing might seem

ques-L A N G U A G E tionable to someone who takes their definition of a statistical method

P ROCESSING

from a standard introduction to statistics Statistical NLP as we define itcomprises all quantitative approaches to automated language processing,including probabilistic modeling, information theory, and linear algebra

Trang 27

While probability theory is the foundation for formal statistical ing, we take the basic meaning of the term ‘statistics’ as being broader,encompassing all quantitative approaches to data (a definition which onecan quickly confirm in almost any dictionary) Although there is thussome potential for ambiguity, Statistical NLP has been the most widelyused term to refer to non-symbolic and non-logical work on NLP over thepast decade, and we have decided to keep with this term

reason-Acknowledgments Over the course of the three years that we were

working on this book, a number of colleagues and friends have madecomments and suggestions on earlier drafts We would like to expressour gratitude to all of them, in particular, Einat Amitay, Chris Brew,Thorsten Brants, Andreas Eisele, Michael Ernst, Oren Etzioni, Marc Fried-man, Eric Gaussier, Eli Hagen, Marti Hearst, Nitin Indurkhya, MichaelInman, Mark Johnson, Rosie Jones, Tom Kalt, Andy Kehler, Julian Ku-piec, Michael Littman, Arman Maghbouleh, Amir Najmi, Kris Popat,Fred Popowich, Geoffrey Sampson, Hadar Shemtov, Scott Stoness, DavidYarowsky, and Jakub Zavrel We are particularly indebted to Bob Car-penter, Eugene Charniak, Raymond Mooney, and an anonymous reviewerfor MIT Press, who suggested a large number of improvements, both incontent and exposition, that we feel have greatly increased the overallquality and usability of the book We hope that they will sense our grat-itude when they notice ideas which we have taken from their commentswithout proper acknowledgement

We would like to also thank: Francine Chen, Kris Halvorsen, and rox PARC for supporting the second author while writing this book, JaneManning for her love and support of the first author, Robert Dale andDikran Karagueuzian for advice on book design, and Amy Brand for herregular help and assistance as our editor

Xe-Feedback While we have tried hard to make the contents of this book

understandable, comprehensive, and correct, there are doubtless manyplaces where we could have done better We welcome feedback to theauthors via email to cmanning@acm.org or hinrich@hotmail.com.

In closing, we can only hope that the availability of a book which lects many of the methods used within Statistical NLP and presents them

Trang 28

col-P r e f a c e xxxlll

in an accessible fashion will create excitement in potential students, andhelp ensure continued rapid progress in the field

Christopher Manning Hinrich Schiitze February 1999

Trang 29

Road Map

IN GENERAL, this book is ,written to be suitable for a graduate-levelsemester-long course focusing on Statistical NLP There is actually rathermore material than one could hope to cover in a semester, but that rich-ness gives ample room for the teacher to pick and choose It is assumedthat the student has prior programming experience, and has some famil-iarity with formal languages and symbolic parsing methods It is alsoassumed that the student has a basic grounding in such mathematicalconcepts as set theory, logarithms, vectors and matrices, summations,and integration - we hope nothing more than an adequate high schooleducation! The student may have already taken a course on symbolic NLPmethods, but a lot of background is not assumed In the directions ofprobability and statistics, and linguistics, we try to briefly summarize allthe necessary background, since in our experience many people wanting

to learn about Statistical NLP methods have no prior knowledge in theseareas (perhaps this will change over time!) Nevertheless, study of sup-plementary material in these areas is probably necessary for a student

to have an adequate foundation from which to build, and can only be ofvalue to the prospective researcher

What is the best way to read this book and teach from it? The book isorganized into four parts: Preliminaries (part I), Words (part II), Grammar(part III), and Applications and Techniques (part IV)

Part I lays out the mathematical and linguistic foundation that the otherparts build on Concepts and techniques introduced here are referred tothroughout the book

Part II covers word-centered work in Statistical NLP There is a ral progression from simple to complex linguistic phenomena in its four

Trang 30

natu-mvi Road Map

chapters on collocations, n-gram models, word sense disambiguation,and lexical acquisition, but each chapter can also be read on its own.The four chapters in part III, Markov Models, tagging, probabilistic con-text free grammars, and probabilistic parsing, build on each other, and sothey are best presented in sequence However, the tagging chapter can beread separately with occasional references to the Markov Model chapter.The topics of part IV are four applications and techniques: statisti-cal alignment and machine translation, clustering, information retrieval,and text categorization Again, these chapters can be treated separatelyaccording to interests and time available, with the few dependencies be-tween them marked appropriately

Although we have organized the book with a lot of background andfoundational material in part I, we would not advise going through all of

it carefully at the beginning of a course based on this book What theauthors have generally done is to review the really essential bits of part I

in about the first 6 hours of a course This comprises very basic bility (through section 2.1.8), information theory (through section 2.2.71,and essential practical knowledge - some of which is contained in chap-ter 4, and some of which is the particulars of what is available at one’sown institution We have generally left the contents of chapter 3 as areading assignment for those without much background in linguistics.Some knowledge of linguistic concepts is needed in many chapters, but

proba-is particularly relevant to chapter 12, and the instructor may wproba-ish to view some syntactic concepts at this point Other material from the earlychapters is then introduced on a “need to know” basis during the course.The choice of topics in part II was partly driven by a desire to be able topresent accessible and interesting topics early in a course, in particular,ones which are also a good basis for student programming projects Wehave found collocations (chapter 51, word sense disambiguation (chap-ter 7), and attachment ambiguities (section 8.3) particularly successful inthis regard Early introduction of attachment ambiguities is also effec-tive in showing that there is a role for linguistic concepts and structures

re-in Statistical NLP Much of the material in chapter 6 is rather detailedreference material People interested in applications like speech or op-tical character recognition may wish to cover all of it, but if n-gramlanguage models are not a particular focus of interest, one may onlywant to read through section 6.2.3 This is enough to understand theconcept of likelihood, maximum likelihood estimates, a couple of simplesmoothing methods (usually necessary if students are to be building any

Trang 31

Road Map xxxvii

probabilistic models on their own), and good methods for assessing theperformance of systems

In general, we have attempted to provide ample cross-references sothat, if desired, an instructor can present most chapters independentlywith incorporation of prior material where appropriate In particular, this

is the case for the chapters on collocations, lexical acquisition, tagging,and information retrieval

Exercises There are exercises scattered through or at the end of every

chapter They vary enormously in difficulty and scope We have tried toprovide an elementary classification as follows:

* Simple problems that range from text comprehension through tosuch things as mathematical manipulations, simple proofs, andthinking of examples of something

* * More substantial problems, many of which involve either ming or corpus investigations Many would be suitable as an as-signment to be done over two weeks

program-* program-* program-* Large, difficult, or open-ended problems Many would be suitable

as a term project

Website Finally, we encourage students and teachers to take advantage

WEBSITE of the material and the references on the companion website. It can be

accessed directly at the URL http://www.sultry.arts.usyd.edu.au/fsnlp, or

found through the MIT Press website http://mitpress.mit.edu, by ing for this book

Trang 32

search-P ART I

Preliminaries

Trang 33

“Statistical considerations are essential to an understanding of the operation and development of languages”

(Lyons 1968: 98)

“One’s ability to produce and recognize grammatical utterances

is not based on notions ofstatistical approximation and the like” (Chomsky 1957: 16)

“You say: the point isn’t the word, but its meaning, and you think of the meaning as a thing of the same kind as the word, though also different from the word Here the word, there the meaning The money, and the cow that you can buy with it (But contrast: money, and its use.)”

(Wittgenstein 1968, Philosophical Investigations, Jj120)

“For a large class of cases-though not for all-in which we employ the word ‘meaning’ it can be defined thus: the meaning

of a word is its use in the language ” (Wittgenstein 1968, 943)

“Now isn‘t it queer that I say that the word ‘is’ is used with two

different meanings (as the copula and as the sign of equality),

and should not care to say that its meaning is its use; its use, that is, as the copula and the sign of equality?”

(Wittgenstein 1968, 3561)

Trang 34

1 Introduction

T HE AIM of a linguistic science is to be able to characterize and explainthe multitude of linguistic observations circling around us, in conversa-tions, writing, and other media Part of that has to do with the cognitiveside of how humans acquire, produce, and understand language, part

of it has to do with understanding the relationship between linguisticutterances and the world, and part of it has to do with understandingthe linguistic structures by which language communicates In order to

which are used to structure linguistic expressions This basic approach

has a long history that extends back at least 2000 years, but in this

cen-tury the approach became increasingly formal and rigorous as linguistsexplored detailed grammars that attempted to describe what were well-formed versus ill-formed utterances of a language

However, it has become apparent that there is a problem with this ception Indeed it was noticed early on by Edward Sapir, who summed it

con-up in his famous quote “All grammars leak” (Sapir 1921: 38) It is justnot possible to provide an exact and complete characterization of well-formed utterances that cleanly divides them from all other sequences

of words, which are regarded as ill-formed utterances This is becausepeople are always stretching and bending the ‘rules’ to meet their com-municative needs Nevertheless, it is certainly not the case that the rulesare completely ill-founded Syntactic rules for a language, such as that abasic English noun phrase consists of an optional determiner, some num-ber of adjectives, and then a noun, do capture major patterns within thelanguage But somehow we need to make things looser, in accounting forthe creativity of language use

Trang 35

1 Introduction

This book explores an approach that addresses this problem head on.Rather than starting off by dividing sentences into grammatical and un-grammatical ones, we instead ask, “What are the common patterns thatoccur in language use?” The major tool which we use to identify thesepatterns is counting things, otherwise known as statistics, and so the sci-entific foundation of the book is found in probability theory Moreover,

we are not merely going to approach this issue as a scientific question,but rather we wish to show how statistical models of language are builtand successfully used for many natural language processing (NLP) tasks.While practical utility is something different from the validity of a the-ory, the usefulness of statistical models of language tends to confirmthat there is something right about the basic approach

Adopting a Statistical NLP approach requires mastering a fair number

of theoretical tools, but before we delve into a lot of theory, this chapterspends a bit of time attempting to situate the approach to natural lan-guage processing that we pursue in this book within a broader context.One should first have some idea about why many people are adopting

a statistical approach to natural language processing and of how oneshould go about this enterprise So, in this first chapter, we examine some

of the philosophical themes and leading ideas that motivate a statisticalapproach to linguistics and NLP, and then proceed to get our hands dirty

by beginning an exploration of what one can learn by looking at statisticsover texts

1.1 Rationalist and Empiricist Approaches to Language

Some language researchers and many NLP practitioners are perfectlyhappy to just work on text without thinking much about the relationshipbetween the mental representation of language and its manifestation inwritten form Readers sympathetic with this approach may feel like skip-ping to the practical sections, but even practically-minded people have

to confront the issue of what prior knowledge to try to build into theirmodel, even if this prior knowledge might be clearly different from whatmight be plausibly hypothesized for the brain This section briefly dis-cusses the philosophical issues that underlie this question

Between about 1960 and 1985, most of linguistics, psychology, cial intelligence, and natural language processing was completely domi-

artifi-RATIONALIST nated by a rationalist approach A rationalist approach is characterized

Tiêu đề	Foundations of statistical natural language processing
Tác giả	Christopher D. Manning, Hinrich Schütze
Trường học	Massachusetts Institute of Technology
Chuyên ngành	Computational Linguistics
Thể loại	Book
Năm xuất bản	1999
Thành phố	Cambridge

Định dạng
Số trang	71
Dung lượng	11,65 MB