We hope that the Natural Language Toolkit NLTK has served to open ex-up the exciting endeavor of practical natural language processing to a broader audiencethan before.. Language Process
Trang 1linguistic annotations An extended discussion of web crawling is provided by (Croft,Metzler & Strohman, 2009).
Full details of the Toolbox data format are provided with the distribution (Buseman,Buseman & Early, 1996), and with the latest distribution freely available from http:// www.sil.org/computing/toolbox/ For guidelines on the process of constructing a Tool-box lexicon, see http://www.sil.org/computing/ddp/ More examples of our efforts withthe Toolbox are documented in (Bird, 1999) and (Robinson, Aumann & Bird, 2007).Dozens of other tools for linguistic data management are available, some surveyed by(Bird & Simons, 2003) See also the proceedings of the LaTeCH workshops on languagetechnology for cultural heritage data
There are many excellent resources for XML (e.g., http://zvon.org/) and for writingPython programs to work with XML http://www.python.org/doc/lib/markup.html.Many editors have XML modes XML formats for lexical information include OLIF(http://www.olif.net/) and LIFT (http://code.google.com/p/lift-standard/)
For a survey of linguistic annotation software, see the Linguistic Annotation Page at http://www.ldc.upenn.edu/annotation/ The initial proposal for standoff annotation was(Thompson & McKelvie, 1997) An abstract data model for linguistic annotations,called “annotation graphs,” was proposed in (Bird & Liberman, 2001) A general-purpose ontology for linguistic description (GOLD) is documented at http://www.lin guistics-ontology.org/
For guidance on planning and constructing a corpus, see (Meyer, 2002) and (Farghaly,2003) More details of methods for scoring inter-annotator agreement are available in(Artstein & Poesio, 2008) and (Pevzner & Hearst, 2002)
Rotokas data was provided by Stuart Robinson, and Iu Mien data was provided by GregAumann
For more information about the Open Language Archives Community, visit http://www language-archives.org/, or see (Simons & Bird, 2003)
11.9 Exercises
1 ◑ In Example 11-2 the new field appeared at the bottom of the entry Modify thisprogram so that it inserts the new subelement right after the lx field (Hint: createthe new cv field using Element('cv'), assign a text value to it, then use the insert() method of the parent element.)
2 ◑ Write a function that deletes a specified field from a lexical entry (We could usethis to sanitize our lexical data before giving it to others, e.g., by removing fieldscontaining irrelevant or uncertain content.)
3 ◑ Write a program that scans an HTML dictionary file to find entries having anillegal part-of-speech field, and then reports the headword for each entry
438 | Chapter 11: Managing Linguistic Data
Trang 24 ◑ Write a program to find any parts-of-speech (ps field) that occurred less than 10times Perhaps these are typing mistakes?
5 ◑ We saw a method for adding a cv field (Section 11.5) There is an interestingissue with keeping this up-to-date when someone modifies the content of the lxfield on which it is based Write a version of this program to add a cv field, replacingany existing cv field
6 ◑ Write a function to add a new field syl which gives a count of the number ofsyllables in the word
7 ◑ Write a function which displays the complete entry for a lexeme When thelexeme is incorrectly spelled, it should display the entry for the most similarlyspelled lexeme
8 ◑ Write a function that takes a lexicon and finds which pairs of consecutive fieldsare most frequent (e.g., ps is often followed by pt) (This might help us to discoversome of the structure of a lexical entry.)
9 ◑ Create a spreadsheet using office software, containing one lexical entry per row,consisting of a headword, a part of speech, and a gloss Save the spreadsheet inCSV format Write Python code to read the CSV file and print it in Toolbox format,using lx for the headword, ps for the part of speech, and gl for the gloss
10 ◑ Index the words of Shakespeare’s plays, with the help of nltk.Index The
result-ing data structure should permit lookup on individual words, such as music,
re-turning a list of references to acts, scenes, and speeches, of the form [(3, 2, 9),(5, 1, 23), ], where (3, 2, 9) indicates Act 3 Scene 2 Speech 9.
11 ◑ Construct a conditional frequency distribution which records the word length
for each speech in The Merchant of Venice, conditioned on the name of the
char-acter; e.g., cfd['PORTIA'][12] would give us the number of speeches by Portiaconsisting of 12 words
12 ◑ Write a recursive function to convert an arbitrary NLTK tree into an XML terpart, with non-terminals represented as XML elements, and leaves represented
coun-as text content, e.g.:
14 ● Build an index of those lexemes which appear in example sentences Suppose
the lexeme for a given entry is w Then, add a single cross-reference field xrf to this
entry, referencing the headwords of other entries having example sentences
con-taining w Do this for all entries and save the result as a Toolbox-format file.
11.9 Exercises | 439
Trang 4Afterword: The Language Challenge
Natural language throws up some interesting computational challenges We’ve plored many of these in the preceding chapters, including tokenization, tagging, clas-sification, information extraction, and building syntactic and semantic representations.You should now be equipped to work with large datasets, to create robust models oflinguistic phenomena, and to extend them into components for practical languagetechnologies We hope that the Natural Language Toolkit (NLTK) has served to open
ex-up the exciting endeavor of practical natural language processing to a broader audiencethan before
In spite of all that has come before, language presents us with far more than a temporarychallenge for computation Consider the following sentences which attest to the riches
of language:
1 Overhead the day drives level and grey, hiding the sun by a flight of grey spears
(William Faulkner, As I Lay Dying, 1935)
2 When using the toaster please ensure that the exhaust fan is turned on (sign indormitory kitchen)
3 Amiodarone weakly inhibited CYP2C9, CYP2D6, and CYP3A4-mediated ties with Ki values of 45.1-271.6 μM (Medline, PMID: 10718780)
activi-4 Iraqi Head Seeks Arms (spoof news headline)
5 The earnest prayer of a righteous man has great power and wonderful results.(James 5:16b)
6 Twas brillig, and the slithy toves did gyre and gimble in the wabe (Lewis Carroll,
Jabberwocky, 1872)
7 There are two ways to do this, AFAIK :smile: (Internet discussion archive)Other evidence for the riches of language is the vast array of disciplines whose workcenters on language Some obvious disciplines include translation, literary criticism,philosophy, anthropology, and psychology Many less obvious disciplines investigatelanguage use, including law, hermeneutics, forensics, telephony, pedagogy, archaeol-ogy, cryptanalysis, and speech pathology Each applies distinct methodologies to gather
441
Trang 5observations, develop theories, and test hypotheses All serve to deepen our standing of language and of the intellect that is manifested in language.
under-In view of the complexity of language and the broad range of interest in studying itfrom different angles, it’s clear that we have barely scratched the surface here Addi-tionally, within NLP itself, there are many important methods and applications that
we haven’t mentioned
In our closing remarks we will take a broader view of NLP, including its foundationsand the further directions you might want to explore Some of the topics are not wellsupported by NLTK, and you might like to rectify that problem by contributing newsoftware and data to the toolkit
Language Processing Versus Symbol Processing
The very notion that natural language could be treated in a computational manner grewout of a research program, dating back to the early 1900s, to reconstruct mathematicalreasoning using logic, most clearly manifested in work by Frege, Russell, Wittgenstein,Tarski, Lambek, and Carnap This work led to the notion of language as a formal systemamenable to automatic processing Three later developments laid the foundation for
natural language processing The first was formal language theory This defined a
language as a set of strings accepted by a class of automata, such as context-free guages and pushdown automata, and provided the underpinnings for computationalsyntax
lan-The second development was symbolic logic This provided a formal method for
cap-turing selected aspects of natural language that are relevant for expressing logicalproofs A formal calculus in symbolic logic provides the syntax of a language, togetherwith rules of inference and, possibly, rules of interpretation in a set-theoretic model;examples are propositional logic and first-order logic Given such a calculus, with awell-defined syntax and semantics, it becomes possible to associate meanings withexpressions of natural language by translating them into expressions of the formal cal-
culus For example, if we translate John saw Mary into a formula saw(j, m), we plicitly or explicitly) interpret the English verb saw as a binary relation, and John and Mary as denoting individuals More general statements like All birds fly require quan-
(im-tifiers, in this case ∀, meaning for all: ∀x (bird(x) → fly(x)) This use of logic provided
the technical machinery to perform inferences that are an important part of languageunderstanding
A closely related development was the principle of compositionality, namely that
the meaning of a complex expression is composed from the meaning of its parts andtheir mode of combination (Chapter 10) This principle provided a useful corre-spondence between syntax and semantics, namely that the meaning of a complex ex-
pression could be computed recursively Consider the sentence It is not true that p, where p is a proposition We can represent the meaning of this sentence as not(p).
442 | Afterword: The Language Challenge
Trang 6Similarly, we can represent the meaning of John saw Mary as saw(j, m) Now we can compute the interpretation of It is not true that John saw Mary recursively, using the foregoing information, to get not(saw(j,m)).
The approaches just outlined share the premise that computing with natural languagecrucially relies on rules for manipulating symbolic representations For a certain period
in the development of NLP, particularly during the 1980s, this premise provided acommon starting point for both linguists and practitioners of NLP, leading to a family
of grammar formalisms known as unification-based (or feature-based) grammar (seeChapter 9), and to NLP applications implemented in the Prolog programming lan-guage Although grammar-based NLP is still a significant area of research, it has becomesomewhat eclipsed in the last 15–20 years due to a variety of factors One significantinfluence came from automatic speech recognition Although early work in speech
processing adopted a model that emulated the kind of rule-based phonological
pho-nology processing typified by the Sound Pattern of English (Chomsky & Halle, 1968),
this turned out to be hopelessly inadequate in dealing with the hard problem of ognizing actual speech in anything like real time By contrast, systems which involvedlearning patterns from large bodies of speech data were significantly more accurate,efficient, and robust In addition, the speech community found that progress in buildingbetter systems was hugely assisted by the construction of shared resources for quanti-tatively measuring performance against common test data Eventually, much of the
rec-NLP community embraced a data-intensive orientation to language processing,
cou-pled with a growing use of machine-learning techniques and evaluation-ledmethodology
Contemporary Philosophical Divides
The contrasting approaches to NLP described in the preceding section relate back to
early metaphysical debates about rationalism versus empiricism and realism versus
idealism that occurred in the Enlightenment period of Western philosophy These
debates took place against a backdrop of orthodox thinking in which the source of allknowledge was believed to be divine revelation During this period of the 17th and 18thcenturies, philosophers argued that human reason or sensory experience has priorityover revelation Descartes and Leibniz, among others, took the rationalist position,asserting that all truth has its origins in human thought, and in the existence of “innateideas” implanted in our minds from birth For example, they argued that the principles
of Euclidean geometry were developed using human reason, and were not the result ofsupernatural revelation or sensory experience In contrast, Locke and others took theempiricist view, that our primary source of knowledge is the experience of our faculties,and that human reason plays a secondary role in reflecting on that experience Often-cited evidence for this position was Galileo’s discovery—based on careful observation
of the motion of the planets—that the solar system is heliocentric and not geocentric
In the context of linguistics, this debate leads to the following question: to what extentdoes human linguistic experience, versus our innate “language faculty,” provide the
Afterword: The Language Challenge | 443
Trang 7basis for our knowledge of language? In NLP this issue surfaces in debates about thepriority of corpus data versus linguistic introspection in the construction of computa-tional models.
A further concern, enshrined in the debate between realism and idealism, was themetaphysical status of the constructs of a theory Kant argued for a distinction betweenphenomena, the manifestations we can experience, and “things in themselves” whichcan never been known directly A linguistic realist would take a theoretical construct
like noun phrase to be a real-world entity that exists independently of human
percep-tion and reason, and which actually causes the observed linguistic phenomena A
lin-guistic idealist, on the other hand, would argue that noun phrases, along with moreabstract constructs, like semantic representations, are intrinsically unobservable, andsimply play the role of useful fictions The way linguists write about theories oftenbetrays a realist position, whereas NLP practitioners occupy neutral territory or elselean toward the idealist position Thus, in NLP, it is often enough if a theoretical ab-straction leads to a useful result; it does not matter whether this result sheds any light
on human linguistic processing
These issues are still alive today, and show up in the distinctions between symbolicversus statistical methods, deep versus shallow processing, binary versus gradient clas-sifications, and scientific versus engineering goals However, such contrasts are nowhighly nuanced, and the debate is no longer as polarized as it once was In fact, most
of the discussions—and most of the advances, even—involve a “balancing act.” Forexample, one intermediate position is to assume that humans are innately endowedwith analogical and memory-based learning methods (weak rationalism), and use thesemethods to identify meaningful patterns in their sensory language experience (empiri-cism)
We have seen many examples of this methodology throughout this book Statisticalmethods inform symbolic models anytime corpus statistics guide the selection of pro-ductions in a context-free grammar, i.e., “grammar engineering.” Symbolic methodsinform statistical models anytime a corpus that was created using rule-based methods
is used as a source of features for training a statistical language model, i.e., “grammaticalinference.” The circle is closed
NLTK Roadmap
The Natural Language Toolkit is a work in progress, and is being continually expanded
as people contribute code Some areas of NLP and linguistics are not (yet) well ported in NLTK, and contributions in these areas are especially welcome Check http: //www.nltk.org/ for news about developments after the publication date of this book.Contributions in the following areas are particularly encouraged:
sup-444 | Afterword: The Language Challenge
Trang 8Phonology and morphology
Computational approaches to the study of sound patterns and word structurestypically use a finite-state toolkit Phenomena such as suppletion and non-concat-enative morphology are difficult to address using the string-processing methods
we have been studying The technical challenge is not only to link NLTK to a performance finite-state toolkit, but to avoid duplication of lexical data and to linkthe morphosyntactic features needed by morph analyzers and syntactic parsers
high-High-performance components
Some NLP tasks are too computationally intensive for pure Python tions to be feasible However, in some cases the expense arises only when trainingmodels, not when using them to label inputs NLTK’s package system provides aconvenient way to distribute trained models, even models trained using corporathat cannot be freely distributed Alternatives are to develop Python interfaces tohigh-performance machine learning tools, or to expand the reach of Python byusing parallel programming techniques such as MapReduce
Natural language generation
Producing coherent text from underlying representations of meaning is an tant part of NLP; a unification-based approach to NLG has been developed inNLTK, and there is scope for more contributions in this area
impor-Linguistic fieldwork
A major challenge faced by linguists is to document thousands of endangered guages, work which generates heterogeneous and rapidly evolving data in largequantities More fieldwork data formats, including interlinear text formats andlexicon interchange formats, could be supported in NLTK, helping linguists tocurate and analyze this data, while liberating them to spend as much time as pos-sible on data elicitation
lan-Other languages
Improved support for NLP in languages other than English could involve work intwo areas: obtaining permission to distribute more corpora with NLTK’s data col-lection; and writing language-specific HOWTOs for posting at http://www.nltk org/howto, illustrating the use of NLTK and discussing language-specific problemsfor NLP, including character encodings, word segmentation, and morphology.NLP researchers with expertise in a particular language could arrange to translatethis book and host a copy on the NLTK website; this would go beyond translatingthe discussions to providing equivalent worked examples using data in the targetlanguage, a non-trivial undertaking
Afterword: The Language Challenge | 445
Trang 9Many of NLTK’s core components were contributed by members of the NLP munity, and were initially housed in NLTK’s “Contrib” package, nltk_contrib.The only requirement for software to be added to this package is that it must bewritten in Python, relevant to NLP, and given the same open source license as therest of NLTK Imperfect software is welcome, and will probably be improved overtime by other members of the NLP community
com-Teaching materials
Since the earliest days of NLTK development, teaching materials have nied the software, materials that have gradually expanded to fill this book, plus asubstantial quantity of online materials as well We hope that instructors whosupplement these materials with presentation slides, problem sets, solution sets,and more detailed treatments of the topics we have covered will make them avail-able, and will notify the authors so we can link them from http://www.nltk.org/ Ofparticular value are materials that help NLP become a mainstream course in theundergraduate programs of computer science and linguistics departments, or thatmake NLP accessible at the secondary level, where there is significant scope forincluding computational content in the language, literature, computer science, andinformation technology curricula
accompa-Only a toolkit
As stated in the preface, NLTK is a toolkit, not a system Many problems will be
tackled with a combination of NLTK, Python, other Python libraries, and interfaces
to external NLP tools and formats
446 | Afterword: The Language Challenge
Trang 10Linguists are sometimes asked how many languages they speak, and have to explainthat this field actually concerns the study of abstract structures that are shared by lan-guages, a study which is more profound and elusive than learning to speak as manylanguages as possible Similarly, computer scientists are sometimes asked how manyprogramming languages they know, and have to explain that computer science actuallyconcerns the study of data structures and algorithms that can be implemented in anyprogramming language, a study which is more profound and elusive than striving forfluency in as many programming languages as possible
This book has covered many topics in the field of Natural Language Processing Most
of the examples have used Python and English However, it would be unfortunate ifreaders concluded that NLP is about how to write Python programs to manipulateEnglish text, or more broadly, about how to write programs (in any programming lan-guage) to manipulate text (in any natural language) Our selection of Python and Eng-lish was expedient, nothing more Even our focus on programming itself was only ameans to an end: as a way to understand data structures and algorithms for representingand manipulating collections of linguistically annotated text, as a way to build newlanguage technologies to better serve the needs of the information society, and ulti-mately as a pathway into deeper understanding of the vast riches of human language
But for the present: happy hacking!
Afterword: The Language Challenge | 447
Trang 12[Abney, 1989] Steven P Abney A computational model of human parsing Journal of Psycholinguistic Research, 18:129–144, 1989.
[Abney, 1991] Steven P Abney Parsing by chunks In Robert C Berwick, Steven P
Abney, and Carol Tenny, editors, Principle-Based Parsing: Computation and linguistics, volume 44 of Studies in Linguistics and Philosophy Kluwer Academic Pub-
Psycho-lishers, Dordrecht, 1991
[Abney, 1996a] Steven Abney Part-of-speech tagging and partial parsing In Ken
Church, Steve Young, and Gerrit Bloothooft, editors, Corpus-Based Methods in guage and Speech Kluwer Academic Publishers, Dordrecht, 1996.
Lan-[Abney, 1996b] Steven Abney Statistical methods and linguistics In Judith Klavans
and Philip Resnik, editors, The Balancing Act: Combining Symbolic and Statistical proaches to Language MIT Press, 1996.
Ap-[Abney, 2008] Steven Abney Semisupervised Learning for Computational Linguistics.
Chapman and Hall, 2008
[Agirre and Edmonds, 2007] Eneko Agirre and Philip Edmonds Word Sense biguation: Algorithms and Applications Springer, 2007.
Disam-[Alpaydin, 2004] Ethem Alpaydin Introduction to Machine Learning MIT Press, 2004.
[Ananiadou and McNaught, 2006] Sophia Ananiadou and John McNaught, editors
Text Mining for Biology and Biomedicine Artech House, 2006.
[Androutsopoulos et al., 1995] Ion Androutsopoulos, Graeme Ritchie, and Peter
Tha-nisch Natural language interfaces to databases—an introduction Journal of Natural Language Engineering, 1:29–81, 1995.
[Artstein and Poesio, 2008] Ron Artstein and Massimo Poesio Inter-coder agreement
for computational linguistics Computational Linguistics, pages 555–596, 2008 [Baayen, 2008] Harald Baayen Analyzing Linguistic Data: A Practical Introduction to Statistics Using R Cambridge University Press, 2008.
449
Trang 13[Bachenko and Fitzpatrick, 1990] J Bachenko and E Fitzpatrick A computational
grammar of discourse-neutral prosodic phrasing in English Computational tics, 16:155–170, 1990.
Linguis-[Baldwin & Kim, 2010] Timothy Baldwin and Su Nam Kim Multiword Expressions
In Nitin Indurkhya and Fred J Damerau, editors, Handbook of Natural Language cessing, second edition Morgan and Claypool, 2010.
Pro-[Beazley, 2006] David M Beazley Python Essential Reference Developer’s Library.
Sams Publishing, third edition, 2006
[Biber et al., 1998] Douglas Biber, Susan Conrad, and Randi Reppen Corpus tics: Investigating Language Structure and Use Cambridge University Press, 1998.
Linguis-[Bird, 1999] Steven Bird Multidimensional exploration of online linguistic field data
In Pius Tamanji, Masako Hirotani, and Nancy Hall, editors, Proceedings of the 29th Annual Meeting of the Northeast Linguistics Society, pages 33–47 GLSA, University of
Massachussetts at Amherst, 1999
[Bird and Liberman, 2001] Steven Bird and Mark Liberman A formal framework forlinguistic annotation Speech Communication, 33:23–60, 2001.
[Bird and Simons, 2003] Steven Bird and Gary Simons Seven dimensions of portability
for language documentation and description Language, 79:557–582, 2003.
[Blackburn and Bos, 2005] Patrick Blackburn and Johan Bos Representation and ference for Natural Language: A First Course in Computational Semantics CSLI Publi-
In-cations, Stanford, CA, 2005
[BNC, 1999] BNC British National Corpus, 1999 [http://info.ox.ac.uk/bnc/].[Brent and Cartwright, 1995] Michael Brent and Timothy Cartwright Distributionalregularity and phonotactic constraints are useful for segmentation In Michael Brent,
editor, Computational Approaches to Language Acquisition MIT Press, 1995.
[Bresnan and Hay, 2006] Joan Bresnan and Jennifer Hay Gradient grammar: An effect
of animacy on the syntax of give in New Zealand and American English Lingua 118:
254–59, 2008
[Budanitsky and Hirst, 2006] Alexander Budanitsky and Graeme Hirst Evaluating
wordnet-based measures of lexical semantic relatedness Computational Linguistics,
32:13–48, 2006
[Burton-Roberts, 1997] Noel Burton-Roberts Analysing Sentences Longman, 1997 [Buseman et al., 1996] Alan Buseman, Karen Buseman, and Rod Early The Linguist’s Shoebox: Integrated Data Management and Analysis for the Field Linguist Waxhaw NC:
SIL, 1996
[Carpenter, 1992] Bob Carpenter The Logic of Typed Feature Structures Cambridge
University Press, 1992
450 | Bibliography
Trang 14[Carpenter, 1997] Bob Carpenter Type-Logical Semantics MIT Press, 1997.
[Chierchia and McConnell-Ginet, 1990] Gennaro Chierchia and Sally
McConnell-Gi-net Meaning and Grammar: An Introduction to Meaning MIT Press, Cambridge, MA,
1990
[Chomsky, 1965] Noam Chomsky Aspects of the Theory of Syntax MIT Press,
Cam-bridge, MA, 1965
[Chomsky, 1970] Noam Chomsky Remarks on nominalization In R Jacobs and P
Rosenbaum, editors, Readings in English Transformational Grammar Blaisdell,
Wal-tham, MA, 1970
[Chomsky and Halle, 1968] Noam Chomsky and Morris Halle The Sound Pattern of English New York: Harper and Row, 1968.
[Church and Patil, 1982] Kenneth Church and Ramesh Patil Coping with syntactic
ambiguity or how to put the block in the box on the table American Journal of putational Linguistics, 8:139–149, 1982.
Com-[Cohen and Hunter, 2004] K Bretonnel Cohen and Lawrence Hunter Natural guage processing and systems biology In Werner Dubitzky and Francisco Azuaje, ed-
lan-itors, Artificial Intelligence Methods and Tools for Systems Biology, page 147–174
Springer Verlag, 2004
[Cole, 1997] Ronald Cole, editor Survey of the State of the Art in Human Language Technology Studies in Natural Language Processing Cambridge University Press,1997
[Copestake, 2002] Ann Copestake Implementing Typed Feature Structure Grammars.
CSLI Publications, Stanford, CA, 2002
[Corbett, 2006] Greville G Corbett Agreement Cambridge University Press, 2006 [Croft et al., 2009] Bruce Croft, Donald Metzler, and Trevor Strohman Search Engines: Information Retrieval in Practice Addison Wesley, 2009.
[Daelemans and van den Bosch, 2005] Walter Daelemans and Antal van den Bosch
Memory-Based Language Processing Cambridge University Press, 2005.
[Dagan et al., 2006] Ido Dagan, Oren Glickman, and Bernardo Magnini The PASCALrecognising textual entailment challenge In J Quinonero-Candela, I Dagan, B Mag-
nini, and F d’Alché Buc, editors, Machine Learning Challenges, volume 3944 of Lecture Notes in Computer Science, pages 177–190 Springer, 2006.
[Dale et al., 2000] Robert Dale, Hermann Moisl, and Harold Somers, editors Handbook
of Natural Language Processing Marcel Dekker, 2000.
[Dalrymple, 2001] Mary Dalrymple Lexical Functional Grammar, volume 34 of Syntax and Semantics Academic Press, New York, 2001.
Bibliography | 451
Trang 15[Dalrymple et al., 1999] Mary Dalrymple, V Gupta, John Lamping, and V Saraswat.Relating resource-based semantics to categorial semantics In Mary Dalrymple, editor,
Semantics and Syntax in Lexical Functional Grammar: The Resource Logic Approach,
pages 261–280 MIT Press, Cambridge, MA, 1999
[Dowty et al., 1981] David R Dowty, Robert E Wall, and Stanley Peters Introduction
to Montague Semantics Kluwer Academic Publishers, 1981.
[Earley, 1970] Jay Earley An efficient context-free parsing algorithm Communications
of the Association for Computing Machinery, 13:94–102, 1970.
[Emele and Zajac, 1990] Martin C Emele and Rémi Zajac Typed unification
gram-mars In Proceedings of the 13th Conference on Computational Linguistics, pages 293–
298 Association for Computational Linguistics, Morristown, NJ, 1990
[Farghaly, 2003] Ali Farghaly, editor Handbook for Language Engineers CSLI
Publi-cations, Stanford, CA, 2003
[Feldman and Sanger, 2007] Ronen Feldman and James Sanger The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data Cambridge Univer-
[Forsyth and Martell, 2007] Eric N Forsyth and Craig H Martell Lexical and discourse
analysis of online chat dialog In Proceedings of the First IEEE International Conference
on Semantic Computing, pages 19–26, 2007.
[Friedl, 2002] Jeffrey E F Friedl Mastering Regular Expressions O’Reilly, second
[Garofolo et al., 1986] John S Garofolo, Lori F Lamel, William M Fisher, Jonathon
G Fiscus, David S Pallett, and Nancy L Dahlgren The DARPA TIMIT Phonetic Continuous Speech Corpus CDROM NIST, 1986.
Acoustic-[Gazdar et al., 1985] Gerald Gazdar, Ewan Klein, Geoffrey Pullum, and Ivan Sag (1985)
Generalized Phrase Structure Grammar Basil Blackwell, 1985.
[Gomes et al., 2006] Bruce Gomes, William Hayes, and Raf Podowski Text mining
In Darryl Leon and Scott Markel, editors, In Silico Technologies in Drug Target fication and Validation, Taylor & Francis, 2006.
Identi-452 | Bibliography
Trang 16[Gries, 2009] Stefan Gries Quantitative Corpus Linguistics with R: A Practical duction Routledge, 2009.
Intro-[Guzdial, 2005] Mark Guzdial Introduction to Computing and Programming in Python:
A Multimedia Approach Prentice Hall, 2005.
[Harel, 2004] David Harel Algorithmics: The Spirit of Computing Addison Wesley,
2004
[Hastie et al., 2009] Trevor Hastie, Robert Tibshirani, and Jerome Friedman The ements of Statistical Learning: Data Mining, Inference, and Prediction Springer, second
El-edition, 2009
[Hearst, 1992] Marti Hearst Automatic acquisition of hyponyms from large text
cor-pora In Proceedings of the 14th Conference on Computational Linguistics (COLING),
[Hodges, 1977] Wilfred Hodges Logic Penguin Books, Harmondsworth, 1977 [Huddleston and Pullum, 2002] Rodney D Huddleston and Geoffrey K Pullum The Cambridge Grammar of the English Language Cambridge University Press, 2002 [Hunt and Thomas, 2000] Andrew Hunt and David Thomas The Pragmatic Program- mer: From Journeyman to Master Addison Wesley, 2000.
[Indurkhya and Damerau, 2010] Nitin Indurkhya and Fred Damerau, editors book of Natural Language Processing CRC Press, Taylor and Francis Group, second
Hand-edition, 2010
[Jackendoff, 1977] Ray Jackendoff X-Syntax: a Study of Phrase Strucure Number 2 in
Linguistic Inquiry Monograph MIT Press, Cambridge, MA, 1977
[Johnson, 1988] Mark Johnson Attribute Value Logic and Theory of Grammar CSLI
Lecture Notes Series University of Chicago Press, 1988
[Jurafsky and Martin, 2008] Daniel Jurafsky and James H Martin Speech and Language Processing Prentice Hall, second edition, 2008.
[Kamp and Reyle, 1993] Hans Kamp and Uwe Reyle From Discourse to the Lexicon: Introduction to Modeltheoretic Semantics of Natural Language, Formal Logic and Dis- course Representation Theory Kluwer Academic Publishers, 1993.
[Kaplan, 1989] Ronald Kaplan The formal architecture of lexical-functional grammar
In Chu-Ren Huang and Keh-Jiann Chen, editors, Proceedings of ROCLING II, pages 1–18 CSLI, 1989 Reprinted in Dalrymple, Kaplan, Maxwell, and Zaenen (eds), Formal
Bibliography | 453
Trang 17Issues in Lexical-Functional Grammar, pages 7–27 CSLI Publications, Stanford, CA,
[Kasper and Rounds, 1986] Robert T Kasper and William C Rounds A logical
se-mantics for feature structures In Proceedings of the 24th Annual Meeting of the ciation for Computational Linguistics, pages 257–266 Association for Computational
Asso-Linguistics, 1986
[Kathol, 1999] Andreas Kathol Agreement and the syntax-morphology interface in
HPSG In Robert D Levine and Georgia M Green, editors, Studies in Contemporary Phrase Structure Grammar, pages 223–274 Cambridge University Press, 1999.
[Kay, 1985] Martin Kay Unification in grammar In Verónica Dahl and Patrick
Saint-Dizier, editors, Natural Language Understanding and Logic Programming, pages 233–
240 North-Holland, 1985 Proceedings of the First International Workshop on NaturalLanguage Understanding and Logic Programming
[Kiss and Strunk, 2006] Tibor Kiss and Jan Strunk Unsupervised multilingual sentence
boundary detection Computational Linguistics, 32: 485–525, 2006.
[Kiusalaas, 2005] Jaan Kiusalaas Numerical Methods in Engineering with Python
Cam-bridge University Press, 2005
[Klein and Manning, 2003] Dan Klein and Christopher D Manning A* parsing: Fast
exact viterbi parse selection In Proceedings of HLT-NAACL 03, 2003.
[Knuth, 2006] Donald E Knuth The Art of Computer Programming, Volume 4: erating All Trees Addison Wesley, 2006.
Gen-[Lappin, 1996] Shalom Lappin, editor The Handbook of Contemporary Semantic Theory Blackwell Publishers, Oxford, 1996.
[Larson and Segal, 1995] Richard Larson and Gabriel Segal Knowledge of Meaning: An Introduction to Semantic Theory MIT Press, Cambridge, MA, 1995.
[Levin, 1993] Beth Levin English Verb Classes and Alternations University of Chicago
454 | Bibliography
Trang 18[Madnani, 2007] Nitin Madnani Getting started on natural language processing with
Python ACM Crossroads, 13(4), 2007.
[Manning, 2003] Christopher Manning Probabilistic syntax In Probabilistic tics, pages 289–341 MIT Press, Cambridge, MA, 2003.
Linguis-[Manning and Schütze, 1999] Christopher Manning and Hinrich Schütze Foundations
of Statistical Natural Language Processing MIT Press, Cambridge, MA, 1999.
[Manning et al., 2008] Christopher Manning, Prabhakar Raghavan, and Hinrich
Schü-tze Introduction to Information Retrieval Cambridge University Press, 2008.
[McCawley, 1998] James McCawley The Syntactic Phenomena of English University
[Miller and Charles, 1998] George Miller and Walter Charles Contextual correlates of
semantic similarity Language and Cognitive Processes, 6:1–28, 1998.
[Mitkov, 2002a] Ruslan Mitkov Anaphora Resolution Longman, 2002.
[Mitkov, 2002b] Ruslan Mitkov, editor Oxford Handbook of Computational tics Oxford University Press, 2002.
Linguis-[Müller, 2002] Stefan Müller Complex Predicates: Verbal Complexes, Resultative structions, and Particle Verbs in German Number 13 in Studies in Constraint-Based
Con-Lexicalism Center for the Study of Language and Information, Stanford, 2002 http:// www.dfki.de/~stefan/Pub/complex.html
[Nerbonne et al., 1994] John Nerbonne, Klaus Netter, and Carl Pollard German in Head-Driven Phrase Structure Grammar CSLI Publications, Stanford, CA, 1994 [Nespor and Vogel, 1986] Marina Nespor and Irene Vogel Prosodic Phonology Num-
ber 28 in Studies in Generative Grammar Foris Publications, Dordrecht, 1986
Bibliography | 455
Trang 19[Nivre et al., 2006] J Nivre, J Hall, and J Nilsson Maltparser: A data-driven generator for dependency parsing In Proceedings of LREC, pages 2216–2219, 2006 [Niyogi, 2006] Partha Niyogi The Computational Nature of Language Learning and Evolution MIT Press, 2006.
parser-[O’Grady et al., 2004] William O’Grady, John Archibald, Mark Aronoff, and Janie
Rees-Miller Contemporary Linguistics: An Introduction St Martin’s Press, fifth edition,
2004
[OSU, 2007] OSU, editor Language Files: Materials for an Introduction to Language and Linguistics Ohio State University Press, tenth edition, 2007.
[Partee, 1995] Barbara Partee Lexical semantics and compositionality In L R
Gleit-man and M LiberGleit-man, editors, An Invitation to Cognitive Science: Language, volume
1, pages 311–360 MIT Press, 1995
[Pasca, 2003] Marius Pasca Open-Domain Question Answering from Large Text lections CSLI Publications, Stanford, CA, 2003.
Col-[Pevzner and Hearst, 2002] L Pevzner and M Hearst A critique and improvement of
an evaluation metric for text segmentation Computational Linguistics, 28:19–36, 2002.
[Pullum, 2005] Geoffrey K Pullum Fossilized prejudices about “however”, 2005
[Radford, 1988] Andrew Radford Transformational Grammar: An Introduction
Cam-bridge University Press, 1988
[Ramshaw and Marcus, 1995] Lance A Ramshaw and Mitchell P Marcus Text
chunk-ing uschunk-ing transformation-based learnchunk-ing In Proceedchunk-ings of the Third ACL Workshop on Very Large Corpora, pages 82–94, 1995.
[Reppen et al., 2005] Randi Reppen, Nancy Ide, and Keith Suderman American Na tional Corpus Linguistic Data Consortium, 2005
[Robinson et al., 2007] Stuart Robinson, Greg Aumann, and Steven Bird Managingfieldwork data with toolbox and the natural language toolkit Language Documentation and Conservation, 1:44–57, 2007.
[Sag and Wasow, 1999] Ivan A Sag and Thomas Wasow Syntactic Theory: A Formal Introduction CSLI Publications, Stanford, CA, 1999.
[Sampson and McCarthy, 2005] Geoffrey Sampson and Diana McCarthy Corpus guistics: Readings in a Widening Discipline Continuum, 2005.
Lin-[Scott and Tribble, 2006] Mike Scott and Christopher Tribble Textual Patterns: Key Words and Corpus Analysis in Language Education John Benjamins, 2006.
[Segaran, 2007] Toby Segaran Collective Intelligence O’Reilly Media, 2007.
[Shatkay and Feldman, 2004] Hagit Shatkay and R Feldman Mining the biomedical
literature in the genomic era: An overview Journal of Computational Biology, 10:821–
855, 2004
456 | Bibliography
Trang 20[Shieber, 1986] Stuart M Shieber An Introduction to Unification-Based Approaches to Grammar, volume 4 of CSLI Lecture Notes Series.CSLI Publications, Stanford, CA,
1986
[Shieber et al., 1983] Stuart Shieber, Hans Uszkoreit, Fernando Pereira, Jane Robinson,and Mabry Tyson The formalism and implementation of PATR-II In Barbara J Grosz
and Mark Stickel, editors, Research on Interactive Acquisition and Use of Knowledge,
techreport 4, pages 39–79 SRI International, Menlo Park, CA, November 1983 (http: //www.eecs.harvard.edu/ shieber/Biblio/Papers/Shieber-83-FIP.pdf)
[Simons and Bird, 2003] Gary Simons and Steven Bird The Open Language Archives
Community: An infrastructure for distributed archiving of language resources Literary and Linguistic Computing, 18:117–128, 2003.
[Sproat et al., 2001] Richard Sproat, Alan Black, Stanley Chen, Shankar Kumar, Mari
Ostendorf, and Christopher Richards Normalization of non-standard words puter Speech and Language, 15:287–333, 2001.
Com-[Strunk and White, 1999] William Strunk and E B White The Elements of Style
Bos-ton, Allyn and Bacon, 1999
[Thompson and McKelvie, 1997] Henry S Thompson and David McKelvie Hyperlink
semantics for standoff markup of read-only documents In SGML Europe ’97, 1997 http://www.ltg.ed.ac.uk/~ht/sgmleu97.html
[TLG, 1999] TLG Thesaurus Linguae Graecae, 1999
[Turing, 1950] Alan M Turing Computing machinery and intelligence Mind, 59(236):
433–460, 1950
[van Benthem and ter Meulen, 1997] Johan van Benthem and Alice ter Meulen, editors
Handbook of Logic and Language MIT Press, Cambridge, MA, 1997.
[van Rossum and Drake, 2006a] Guido van Rossum and Fred L Drake An Introduction
to Python—The Python Tutorial Network Theory Ltd, Bristol, 2006.
[van Rossum and Drake, 2006b] Guido van Rossum and Fred L Drake The Python Language Reference Manual Network Theory Ltd, Bristol, 2006.
[Warren and Pereira, 1982] David H D Warren and Fernando C N Pereira An
effi-cient easily adaptable system for interpreting natural language queries American nal of Computational Linguistics, 8(3-4):110–122, 1982.
Jour-[Wechsler and Zlatic, 2003] Stephen Mark Wechsler and Larisa Zlatic The Many Faces
of Agreement Stanford Monographs in Linguistics CSLI Publications, Stanford, CA,
2003
[Weiss et al., 2004] Sholom Weiss, Nitin Indurkhya, Tong Zhang, and Fred Damerau
Text Mining: Predictive Methods for Analyzing Unstructured Information Springer,
2004
Bibliography | 457
Trang 21[Woods et al., 1986] Anthony Woods, Paul Fletcher, and Arthur Hughes Statistics in Language Studies Cambridge University Press, 1986.
[Zhao and Zobel, 2007] Y Zhao and J Zobel Search with style: Authorship attribution
in classic literature In Proceedings of the Thirtieth Australasian Computer Science ference Association for Computing Machinery, 2007.
Con-458 | Bibliography