com-It is important to be aware that computational tools for corpus annotation and analysis are available through diverse types of user interface, including graphical user interfaces, we
Trang 2Computational Methods for Corpus Annotation and Analysis
Trang 3Xiaofei Lu
Computational Methods for Corpus Annotation and Analysis
1 3
Trang 4ISBN 978-94-017-8644-7 ISBN 978-94-017-8645-4 (eBook)
DOI 10.1007/978-94-017-8645-4
Springer New York Heidelberg Dordrecht London
Library of Congress Control Number: 2014931404
© Springer Science+Business Media Dordrecht 2014
This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recita- tion, broadcasting, reproduction on microfilms or in any other physical way, and transmission or infor- mation storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar meth- odology now known or hereafter developed Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work Duplica- tion of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer Permissions for use may be obtained through RightsLink at the Copyright Clearance Center Violations are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc in this tion does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of tion, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors
publica-or omissions that may be made The publisher makes no warranty, express publica-or implied, with respect to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Xiaofei Lu
Department of Applied Linguistics
The Pennsylvania State University
University Park
Pennsylvania
USA
Trang 5Dedicated to my wife, Xiaomeng, and our daughter, Jasmine.
Trang 6Preface
This book grew out of sets of lecture notes for a graduate course on computational methods for corpus annotation and analysis that I have taught in the Department of Applied Linguistics at The Pennsylvania State University since 2006 After several iterations of the course, my students and I realized that while there is an abundance
of introductory sources on the fundamentals of corpus linguistics, most of them do not provide the types of detailed and systematic instructions that are necessary to help language and linguistics researches get off the ground with using computa-tional tools other than concordancing programs for automatic corpus annotation and analysis A large proportion of the students taking the course were not yet ready to embark on learning to program, and to them the introductory sources on program-ming for linguistics, natural language processing, and computational linguistics appeared overwhelming What seemed to be lacking was something in the middle ground, something that enables novice language and linguistics researchers to use more sophisticated and powerful corpus annotation and analysis tools than con-cordancing programs and yet still does not require programming This book was written with the aim to provide that middle ground
I owe a special thanks to all the students who have taken the course with me at The Pennsylvania State University This book could not have been written without their inspiration In particular, I want to thank Brody Bluemel and Ben Pin-Yun Wang for providing very detailed feedback on earlier drafts of several chapters; Edie Furniss, Qianwen Li, and many others for pointing me to various stylistic issues in the book; Haiyang Ai, Brody Bluemel, Tracy Davis, Alissa Hartig, Shibamouli Lahiri, Kwan-ghyun Park, Jie Zhang, and Xian Zhang for numerous discussions about the lecture notes that the book grew out of while taking and/or co-teaching the course with me
It would be difficult to thank all the people who have influenced the ideas behind this book I am deeply indebted to Detmar Meurers and Chris Brew, who first intro-duced me to the field of computational linguistics I have also learned tremendously from a large number of other colleagues, directly or indirectly To name just a few: Gabriella Appel, Stacey Bailey, Douglas Biber, Donna Byron, Marjorie Chan, Rich-ard Crouch, Markus Dickinson, Nick Ellis, Anna Feldman, Eric Fosler-Lussier, ZhaoHong Han, Jirka Hana, Erhard Hinrichs, Tracy Holloway King, James Lantolf,
Trang 8Contents
1 Introduction 1
1.1 Objectives and Rationale of the Book 1
1.2 Why Do We Need to Go Beyond Raw Corpora 3
1.3 What Is Corpus Annotation 5
1.4 Organization of the Book 6
References 7
2 Text Processing with the Command Line Interface 9
2.1 The Command Line Interface 9
2.2 Basic Commands 11
2.2.1 Notational Conventions 11
2.2.2 Printing the Current Working Directory 11
2.2.3 Listing Files and Subdirectories 12
2.2.4 Making New Directories 12
2.2.5 Changing Directory Locations 13
2.2.6 Creating and Editing Text Files with UTF-8 Encoding 14
2.2.7 Viewing, Renaming, Moving, Copying, and Removing Files 16
2.2.8 Copying, Moving, and Removing Directories 20
2.2.9 Using Shell Meta-Characters for File Matching 21
2.2.10 Manual Pages, Command History, and Command Line Completion 21
2.3 Tools for Text Processing 22
2.3.1 Searching for a String with egrep 22
2.3.2 Regular Expressions 24
2.3.3 Character Translation with tr 29
2.3.4 Editing Files from the Command Line with sed 30
2.3.5 Data Filtering and Manipulation Using awk 31
2.3.6 Task Decomposition and Pipes 35
2.4 Summary 38
References 38
Trang 93 Lexical Annotation 39
3.1 Part-of-Speech Tagging 39
3.1.1 What is Part-of-Speech Tagging 39
3.1.2 Understanding Part-of-Speech Tagsets 42
3.1.3 The Stanford Part-of-Speech Tagger 46
3.2 Lemmatization 54
3.2.1 What is Lemmatization and Why is it Useful 54
3.2.2 The TreeTagger 55
3.3 Additional Tools 58
3.3.1 The Stanford Tokenizer 58
3.3.2 The Stanford Word Segmenter for Arabic and Chinese 59
3.3.3 The CLAWS Tagger for English 61
3.3.4 The Morpha Lemmatizer for English 61
3.4 Summary 64
References 64
4 Lexical Analysis 67
4.1 Frequency Lists 67
4.1.1 Working with Output Files from the TreeTagger 68
4.1.2 Working with Output Files from the Stanford POS Tagger and Morpha 72
4.1.3 Analyzing Frequency Lists with Text Processing Tools 73
4.2 N-Grams 76
4.3 Lexical Richness 80
4.3.1 Lexical Density 80
4.3.2 Lexical Variation 82
4.3.3 Lexical Sophistication 84
4.3.4 Tools for Lexical Richness Analysis 84
4.4 Summary 90
References 91
5 Syntactic Annotation 95
5.1 Syntactic Parsing Overview 95
5.1.1 What is Syntactic Parsing and Why is it Useful? 95
5.1.2 Phrase Structure Grammars 96
5.1.3 Dependency Grammars 102
5.2 Syntactic Parsers 106
5.2.1 The Stanford Parser 106
5.2.2 Collins’ Parser 110
5.3 Summary 112
References 113
Contents
Trang 10xi Contents
6 Syntactic Analysis 115
6.1 Querying Syntactically Parsed Corpora 115
6.1.1 Tree Relationships 115
6.1.2 Tregex 121
6.2 Syntactic Complexity Analysis 130
6.2.1 Measures of Syntactic Complexity 130
6.2.2 Syntactic Complexity Analyzers 136
6.3 Summary 142
References 142
7 Semantic, Pragmatic and Discourse Analysis 147
7.1 Semantic Field Analysis 147
7.1.1 The UCREL Semantic Analysis System 147
7.1.2 Profile in Semantics-Lexical in Computerized Profiling 152
7.2 Analysis of Propositions 154
7.2.1 Computerized Propositional Idea Density Rater 154
7.2.2 Analysis of Propositions in Computerized Profiling 157
7.3 Conversational Act Analysis in Computerized Profiling 158
7.4 Coherence and Cohesion Analysis in Coh-Metrix 160
7.4.1 Referential Cohesion Features 160
7.4.2 Features Based on Latent Semantic Analysis 161
7.4.3 Features Based on Connectives 162
7.4.4 Situation Model Features 163
7.4.5 Word Information Features 164
7.5 Text Structure Analysis 164
7.6 Summary 169
References 170
8 Summary and Outlook 175
8.1 Summary of the Book 175
8.2 Future Directions in Computational Corpus Analysis 177
8.2.1 Computational Analysis of Language Meaning and Use 178
8.2.2 Computational Analysis of Learner Language 178
8.2.3 Computational Analysis Based on Specific Language Theories 180
References 182
Appendix 185
Trang 11Chapter 1
Introduction
X Lu, Computational Methods for Corpus Annotation and Analysis,
DOI 10.1007/978-94-017-8645-4_1, © Springer Science+Business Media Dordrecht 2014
Abstract This introductory chapter provides a brief overview of the objectives and
rationale of the book, the need for corpus annotation, the key concepts and issues involved in corpus annotation and in using annotated corpora for corpus linguistics research, and the organization of the book
1.1 Objectives and Rationale of the Book
The primary goal of this book is to provide a systematic and accessible introduction
to the state-of-the-art computational systems and software programs that can be used to automate or semi-automate the annotation and analysis of text corpora at diverse linguistic levels This is not intended to be yet another introductory book
on corpus linguistics that walks you through the definition of corpus, the history
of corpus linguistics, the principles of corpus design and compilation, the myriad
of corpora that are freely or commercially available, the types of word frequency, collocational, phraseological, and lexico-grammatical analysis you can perform on unannotated or raw corpora with concordancing programs, or the various ways in which such analyses have been used in previous corpus linguistics research Need-less to say, these topics and issues are of fundamental importance and should be an integral part of any systematic training on corpus linguistics one might receive For this very reason, they have been and continue to be extensively discussed in numer-ous other introductory sources (e.g., Biber et al 1998; Hunston 2002; Kennedy
1998; Lüdeling and Kytö 2009; McEnery et al 2006; McEnery and Hardie 2011; O’Keeffe and McCarthy 2010; Teubert and Čermáková 2004) In this book, how-ever, I will set aside these topics, assuming that you either already have some famil-iarity with them or will be reading this book along with sources that introduce you
to them Instead, I will bring to the spotlight the issue of corpus annotation and the analysis of annotated corpora using computational tools
Most linguistics researchers who use corpora in their research in one way or another are probably no complete stranger to the idea of corpus annotation and its usefulness However, to many researchers, and especially those who are relatively new to the field of corpus linguistics, this is also one of the most challenging and
to some extent intimidating aspects of doing corpus linguistics, due to the amount
of seemingly sophisticated computational processing involved Whereas it is not
Trang 122 1 Introduction
uncommon for corpus linguistics sources to have some coverage of corpus tion, most of them do not go beyond a general description of the types of corpus annotation possible, an illustration of the schemes and formats used in those types of annotation, a discussion of good practice in manual annotation, and a brief mention of some of the computational tools that are available Few sources provide adequate de-tails for a novice researcher to acquire the necessary skills to automate different types
annota-of linguistic annotation on large-scale corpus data and to effectively query annotated corpora Although there is an abundance of discussion on automatic annotation and analysis of large text corpora in the computational linguistics and natural language processing literature, the focus there is generally on the details of the computational and mathematic algorithms used to realize and optimize such annotation and analysis (e.g., Jurafsky and Martin 2008; Manning and Schütze 1999; Roark and Sproat 2007) For many linguistics researchers, such details are usually neither directly relevant nor easily comprehensible Recognizing the critical importance of corpus annotation in corpus linguistics research, I have undertaken to write this book to address the lack of
a systematic, in-depth, and hands-on treatment of this issue I will have relatively little
to say about good practice in manual annotation, but will focus on making the putational processes involved in automatic corpus annotation and analysis accessible
com-It is important to be aware that computational tools for corpus annotation and analysis are available through diverse types of user interface, including graphical user interfaces, web-based interfaces, and command line interfaces To take the best advantage of the computational tools that are available, one should not restrict one-self to a particular type of user interface Instead, it is desirable to be able to utilize different tools through different types of user interface to meet different kinds of analytical needs If you have used concordancing programs such as WordSmith Tools (Scott 2012) or AntConc (Anthony 2010) and searched online corpora such
as the Corpus of Contemporary American English (COCA, Davies 2008), then you are already familiar with graphical user interfaces and web-based interfaces The command line interface is used more commonly in UNIX and UNIX-like operat-ing systems such as Linux and Mac OS X and tends to be less familiar to language and linguistics researchers than graphical user interfaces and web-based interfaces However, many powerful corpus annotation and analysis tools are available through the command line interface only or work more efficiently in the command line in-terface than in web-based or graphical user interfaces
The computational tools introduced in this book are not organized by type of user interface, but by level of linguistic annotation or analysis For each level of linguistic annotation and analysis, our goal is to make accessible an adequate set
of state-of-the-art computational tools, which may be available through different types of user interface As tools with graphical user interfaces or web-based interfaces are usually intuitive to learn and use, our discussion of these tools will
be relatively brief, focusing primarily on the types of linguistic annotation and analysis they facilitate However, we do not assume any prior experience with the command line interface but provide step-by-step instructions for all tools that are accessed through this interface If you are already familiar with the command line interface, you may skip these instructions and focus on the discussion of the functionalities of the corpus annotation and analysis tools instead
Trang 133 1.2 Why Do We Need to Go Beyond Raw Corpora
It should become clear in later chapters of the book that much can be achieved with the computational tools covered here As you become more experienced with diverse types of corpus annotation and analysis tasks, you will likely also realize that the ability to write your own scripts to format, process and analyze raw and an-notated texts gives you significantly more analytical flexibility and power Indeed, for advanced corpus linguistics researchers, it is desirable to have good command
of both the state-of-the-art corpus annotation and analysis tools and one or more programming languages These two sets of knowledge are complementary to and facilitate the acquisition of each other When you are ready to learn a programming language, you are recommended to refer to one or more of a number of books that have been written specifically to introduce linguistics researchers to scripting lan-guages such as Python (e.g., Bird et al 2009; Downey 2012; Perkins 2010), Perl (e.g., Schwartz et al 2005; Weisser 2010), and R (e.g., Gries 2009)
In the rest of this chapter, I will first motivate the enterprise of corpus tion by illustrating the limitations of working with raw corpora and the additional insights one could gain by working with corpora annotated with different levels
annota-of linguistic information I will then provide a brief account annota-of the key concepts and issues involved in corpus annotation and in using annotated corpora for corpus linguistics research Finally, I will close the chapter with a description of the orga-nization of the book
1.2 Why Do We Need to Go Beyond Raw Corpora
In linguistics research, corpora have been commonly used in the following two ways First, with the increasing availability of large-scale corpora that are search-able online, many studies have examined various linguistic phenomena by querying online corpora through their built-in search interfaces Some examples of oft-used online corpora that have served as the basis of many recent corpus linguistics studies include the Corpus of Contemporary American English (COCA, Davies 2008), the Corpus of Historical American English (COHA, Davies 2010), the Michigan Cor-pus of Academic Spoken English (MICASE, Simpson et al 2002), the Academia Sinica Balanced Corpus of Modern Chinese (Sinica Corpus, Chen et al 1996), and the Hong Kong Corpus of Spoken English (HKCSE, Cheng et al 2008), to name just a few While many online corpora contain raw data only (e.g., MICASE and HKCSE), others provide part-of-speech (POS) or morphologically annotated data (e.g., COCA, COHA, and Sinica Corpus) In the latter case, it is possible to incor-porate POS or morphological information in one’s corpus queries to obtain more
accurate search results For example, one may search for all occurrences of work used as a verb, including occurrences of its inflected forms (i.e., works, worked, and
working) Large-scale online corpora annotated with other levels of linguistic
in-formation, such as syntactic and semantic inin-formation, exist as well, although they are relatively harder to find The Russian National Corpus (RNC, Apresjan et al
2006), in which words are tagged with both grammatical and semantic features, constitutes an excellent example of this type of corpus Depending on its lexical
Trang 144 1 Introduction
category, a word in the RNC may be assigned a particular set of semantic tags As
an illustration, a verb may be assigned one of the following taxonomic classes: movement, placement, physical impact, change of state or property, sphere of be-ing, location, or contact and support; furthermore, it may be tagged as a causative
or a non-causative verb; in the case of an auxiliary verb, it may be tagged as a phasal verb or an auxiliary causative verb; finally, in the case of a derivational form,
it may be tagged as a prefixal verb, a semelfactive, or a secondary imperfective These semantic tags, in combination with grammatical tags that indicate a word’s lexical category, and when applicable, its person, gender, and case, etc., facilitate fine-grained searches that allow one to retrieve occurrences of words with specific lexico-semantic characteristics All in all, compared with their raw counterparts, online corpora that incorporate richer linguistic annotation lend themselves useful
in a much wider range of linguistic explorations
While the value of online searchable corpora in corpus linguistics research cannot be understated, researchers are also constrained by the types of data included
in the corpora, the types of linguistic annotation that have been added to the data, and the functionalities of the built-in search interfaces As such, it is often necessary for researchers to compile their own corpora, or, in some cases, to obtain the actual text files of publicly or commercially available corpora so that they can be analyzed with additional tools Concordancing programs such as WordSmith Tools (Scott
2012) and AntConc (Anthony 2010) have proven to be extremely popular and ful as tools for querying and analyzing such corpora Whereas some variation exists
use-in the specific functionalities of different concordancuse-ing programs, most of them allow users to generate frequency lists of words and clusters (or n-grams), retrieve occurrences of a search word, phrase, or pattern and display them in a keyword in context (KWIC) format, analyze the collocates of the search word, and compare the words used in two different corpora These functionalities have found applications
in numerous studies for examining a wide range of linguistic phenomena Needless
to say, concordancing programs have played and will in all likelihood continue to play a very important role in corpus linguistics research However, as is the case with unannotated corpora that are searchable online, raw corpora and concordanc-ing programs are insufficient for all types of linguistic analysis one may wish to per-form When we deal with individual words in a raw corpus, we will again be forced
to either ignore or manually differentiate between the multiple POS categories (e.g.,
work as a noun and a verb), meanings (e.g., gay meaning homosexual and happy),
and forms (e.g., be, is, am, are, was, were, and been) a word may have The ability to
resolve ambiguities in the uses of the same words as well as to recognize their ant forms will undoubtedly enable more accurate and fine-grained lexical analyses
vari-In addition, as we attempt to identify occurrences of a phrasal, clausal or sentential structure in a raw corpus, such as a noun phrase with a post-modifying prepositional phrase, we could at best specify a set of specific surface forms that can potentially realize the structure, search for these forms in the corpus, and then manually deter-mine whether each of the retrieved instances indeed realizes the structure This is obviously a laborious process More seriously, it is usually very difficult to come
up with the complete set of surface forms for a structure The ability to analyze the
Trang 155 1.3 What Is Corpus Annotation
phrasal and syntactic structure of the sentences in the corpus will make it possible
to automatically retrieve all instances of a given structure
1.3 What Is Corpus Annotation
Corpus annotation refers to the practice of adding linguistic information to a corpus
of written or spoken language The types of linguistic information that can be added
to corpora are wide-ranging, including lexical, morphological, syntactic, tic, pragmatic, and discoursal, among others In the case of spoken corpora, one can also include phonetic and prosodic information In this section, we will briefly discuss some key concepts and issues involved in corpus annotation and in using annotated corpora for corpus linguistics research More detailed discussion of an-notation at different linguistic levels will take place in Chaps 3–8 The principles, practices, schemes and formats for corpus annotation are also discussed in most
seman-of the introductory sources and handbooks on corpus linguistics mentioned earlier (e.g., Biber et al 1998; Huston 2002; Kennedy 1998; Lüdeling and Kytö 2009; McEnery et al 2006; McEnery and Hardie 2011; O’Keeffe and McCarthy 2010; Teubert and Čermáková 2007)
A critical part of any corpus annotation project, regardless of its type and scale,
is the annotation scheme First and foremost, the annotation scheme should contain explicit and complete information on the linguistic categories to be differentiated during the annotation process These categories depend not only on the type of annotation at issue but also on the degree of specificity that is desired or required given the purposes of the annotation For example, in the case of POS annotation, the annotation scheme should clearly define the POS categories that need to be dif-ferentiated Whereas distinctions between general POS categories such as nouns, verbs, adverbs, and adjectives can be expected in any POS annotation scheme, more fine-grained distinctions between subtypes of specific POS categories (e.g., differ-ent types of nouns) often differ from scheme to scheme, as not all distinctions (e.g., whether a common noun is a temporal noun or not) are considered important or necessary in all projects A second part of the annotation scheme is a set of labels (also variably referred to as tags, codes, or symbols) that are designed to denote the linguistic categories, with a one-to-one correspondence between the labels and the categories These labels should be concise and intuitively meaningful In addition, whenever distinctions are made between subcategories of a general category, the labeling system should be designed to capture the commonalities among the sub-categories For example, if proper nouns and common nouns are differentiated in a POS annotation scheme, the labels used to denote proper and common nouns should enable us to easily tell that they are both nouns Finally, the annotation scheme should also contain a set of guidelines that explain how different linguistic units
in the corpus are to be assigned to the linguistic categories defined in the scheme and subsequently annotated with appropriate labels These guidelines are particu-larly important for resolving cases that may be ambiguous and that may not be
Trang 166 1 Introduction
consistently treated in different annotation schemes For example, a POS annotation scheme needs to explain when past participles should be treated as adjectives and when they should be treated as verbs
The second critical consideration in a corpus annotation project is the format to
be used in the annotation This has to do with how the labels are to be applied to the appropriate linguistic units in the raw corpus Regardless of the format adopted,
it is recommended that the annotations be easily separable from the raw corpus In other words, it should be possible for one to not only examine the raw texts and their annotations at the same time, but also separate the texts from their annotations and examine them individually The Extensible Markup Language (XML) has increas-ingly become the markup language of choice in corpus annotation projects as well
as computer programs designed to assist manual annotation (e.g., the UAM Tool, O’Donnell 2008), as it facilitates the addition of multiple layers of annotation that are easily separable from the raw corpus and that can be reused and extended
Corpus-in later work However, as will become obvious Corpus-in the followCorpus-ing chapters, many tomatic corpus annotation programs do not adopt any particular markup language Rather, they use variable output formats that fit their purposes To take advantage of the linguistic annotations added to raw corpora, it is important to understand what linguistic information is provided as well as how such information is provided
au-1.4 Organization of the Book
The primary objective of this book is to provide an accessible introduction to the types of linguistic annotation that can be reasonably accurately automated, the com-putational tools and software programs that can be used to automate the annota-tions, and the ways in which the annotated corpora can be effectively queried and analyzed This introduction is intended for language and linguistics students and researchers who have been or are currently being exposed to some of the introduc-tory sources mentioned above that cover the fundamentals of corpus linguistics, are familiar with using concordancing programs to analyze raw corpora, and are interested in learning more about using computational tools for annotating corpora and analyzing annotated corpora Prior knowledge of the command line interface or programming will help you follow the discussion in this book, but is not assumed.The rest of the book is organized as follows Chap 2, “Text Processing with the Command Line Interface”, demystifies the command line interface that is com-monly used in UNIX and UNIX-like systems (e.g., Linux and Mac OS X) It walks the reader through a set of basic commands with concrete examples and introduces
a number of simple but robust command line interface tools for text processing.Chapter 3, “Lexical Annotation”, focuses on technology for automatic POS tag-ging and lemmatization For POS tagging, it discusses the nature and applications
of the task, describes the tagsets used in POS tagging, and provides a step-by-step tutorial on the usage of the Stanford POS Tagger (Toutanova et al 2003) For lem-matization, it explains the definition and usefulness of the process and provides
Trang 177 References
detailed instructions for using the TreeTagger (Schmid 1994, 1995), a tool for tilingual POS tagging and lemmatization Several additional tools for tokenization, word segmentation, POS tagging, and lemmatization are also examined
mul-Chapter 4, “Lexical Analysis”, exemplifies how POS tagged and lemmatized corpora can be employed effectively to enrich computational lexical analysis Ex-amples include generating and analyzing frequency lists with POS and lemma in-formation, generating n-gram lists with POS and lemma information, and analysis
of lexical density, variation and sophistication using a large set of measures posed in the language acquisition literature
pro-Chapter 5, “Syntactic Annotation”, focuses on technology for syntactic parsing The chapter first briefly explains the concept of syntactic parsing and two grammar formalisms that are commonly adopted in syntactic parsers It then illustrates the usage of a number of state-of-the-art syntactic parsers
Chapter 6, “Syntactic Analysis”, introduces tools that can be used to effectively query syntactically annotated corpora to extract sentences that contain the structures
of interest to the researcher It also reviews the measures of syntactic complexity commonly used in first and second language development research and details the usage of a number of tools that can be used to automatically analyze the syntactic complexity of first and second language samples
Chapter 7, “Semantic, Pragmatic and Discourse analysis”, focuses on tional tools for automating or assisting the annotation and analysis of texts at the semantic, pragmatic and discourse levels A number of tools for the analysis of semantic fields, propositions, conversational acts, coherence and cohesion, and text structure are examined
computa-Chapter 8, “Summary and Outlook”, summarizes the range of computational tools for corpus annotation and analysis covered in the book and concludes the book with a discussion of future directions in computational corpus analysis, focusing in particular on the analysis of language meaning and use, learner language analysis, and analysis based on specific theories of language
References
Anthony, L 2010 AntConc, Version 3.2.1 Tokyo, Japan: Waseda University http://www.antlab.
sci.waseda.ac.jp Accessed 11 May 2013.
Apresjan, J., I Boguslavsky., B Iomdin., L Iomdin., A Sannikov., and V Sizov 2006 A
syntacti-cally and semantisyntacti-cally tagged corpus of Russian: State of the art and prospects In ings of the Fifth International Conference on Language Resource and Evaluation, 1378–1381
Chen, K.-J., C.-R Huang., L.-P Chang, and H.-L Hsu 1996 Sinica Corpus: Design methodology
for balanced corpus In Proceedings of the Eleventh Pacific Asia Conference on Language, Information, and Computation, 167–176 Seoul: Kyung Hee University.
Trang 188 1 Introduction
Cheng, W., C Greaves, and M Warren 2008 A corpus-driven study of discourse intonation: The Hong Kong corpus of spoken English (prosodic) Amsterdam: John Benjamins.
Davies, M 2008 The corpus of contemporary American English Provo: Brigham Young
Univer-sity http://corpus.byu.edu/coca Accessed 11 May 2013.
Davies, M 2010 The corpus of historical American English Provo: Brigham Young University
http://corpus.byu.edu/coha Accessed 11 May 2013.
Downey, A B 2012 Think Python: How to think like a computer scientist Cambridge: O’Reilly
River: Prentice Hall.
Kennedy, G 1998 An introduction to corpus linguistics London: Longman.
Lüdeling, A., and M Kytö eds 2009 Corpus linguistics: An international handbook, vols 1, 2
Berlin: Mouton de Gruyter.
Manning, C., and H Schütze 1999 Foundations of statistical natural language processing
Cam-bridge: The MIT Press.
McEnery, T., and A Hardie 2011 Corpus linguistics: Method, theory and practice Cambridge:
Cambridge University Press.
McEnery, T., R Xiao., and Y Tono 2006 Corpus-based language studies: An advanced resource book London: Routledge.
O’Donnell, M 2008 Demonstration of the UAM CorpusTool for text and image annotation In
Proceedings of the Forty-Sixth Annual Meeting of the Association for Computational tics: Human Language Technologies Demo Session, 13–16 Stroudsburg: Association for Com-
Linguis-putational Linguistics.
O’Keeffe, A., and A McCarthy eds 2010 The Routledge handbook of corpus linguistics London:
Routledge.
Perkins, J 2010 Python text processing with NLTK 2.0 cookbook Sebastopol: O’Reilly.
Roark, B., and R Sproat 2007 Computational approaches to morphology and syntax Oxford:
Oxford University Press.
Schmid, H 1994 Probabilistic part-of-speech tagging using decision trees In Proceedings of the International Conference on New Methods in Language Processing, 44–49 Manchester: Uni-
versity of Manchester.
Schmid, H 1995 Improvements in part-of-speech tagging with an application to German In ceedings of the SIGDAT Workshop at the Seventh Conference of the European Chapter of the Association for Computational Linguistics, 172–176 Stroudsburg: Association for Computa-
Pro-tional Linguistics.
Schwartz, R L., T Phoenix., and B D Foy 2005 Learning Perl, 4th ed Cambridge: O’Reilly
Media.
Scott, M 2012 WordSmith Tools, Version 6 Liverpool: Lexical Analysis Software http://www.
lexically.net/wordsmith Accessed 11 May 2013.
Simpson, R C., S L Briggs., J Ovens., and J M Swales 2002 The Michigan Corpus of Academic Spoken English Ann Arbor: University of Michigan http://quod.lib.umich.edu/m/
micase Accessed 11 May 2013.
Toutanova, K., D Klein., C Manning., and Y Singer 2003 Feature-rich part-of-speech tagging
with a cyclic dependency network In Proceedings of Human Language Technologies: The
2003 Conference of the North American Chapter of the Association for Computational guistics, 252–259 Stroudsburg: Association for Computational Linguistics.
Lin-Teubert, W., and A Čermáková 2007 Corpus linguistics: A short introduction London:
Con-tinuum.
Weisser, M 2010 Essential programming for linguistics Edinburgh: Edinburgh University Press.
Trang 19Chapter 2
Text Processing with the Command Line
Interface
X Lu, Computational Methods for Corpus Annotation and Analysis,
DOI 10.1007/978-94-017-8645-4_2, © Springer Science+Business Media Dordrecht 2014
Abstract This chapter aims to help demystify the command line interface that is
commonly used in UNIX and UNIX-like systems such as Linux and Mac OS X for language and linguistics researchers with little or no prior experience with it and to illustrate how it can be used for managing the file system and, more importantly, for text processing Whereas most linguists are used to and comfortable with the graphic user interface, the command line interface does provide us with access to a wide range of computational tools for corpus processing, annotation, and analysis that may not be readily accessible through the graphic user interface The specific command line interface used for illustration purposes in this chapter is the Terminal
in Mac OS X, but the examples work in largely similar ways in the command line interface in a UNIX or Linux system
2.1 The Command Line Interface
If you have only used the graphic user interface in a Windows-based PC or a Mac
OS X to meet your computing needs, but have never or rarely used the Command Prompt in a Windows-based PC, the Terminal in a Mac OS X, or a computer with a UNIX or Linux operating system, you probably think of the command line interface
as something that is useful only for geeky scientists and engineers However, once
in a while, you may have encountered one or more text processing tools or corpus annotation and analysis programs that do not have a graphic user interface version but rather can only be invoked from the command line in a UNIX or UNIX-like system (e.g., Linux and Mac OS X), and you may have given up on them with a shake of your head Although at first look the command line interface may not be as user-friendly and intuitive as the graphic user interface, once you have learned the basics of how it works, you will find it a versatile and powerful way of interacting with the computer More importantly, the command line interface enables us to ac-cess a large set of useful corpus processing, annotation and analysis tools that are not conveniently available via the graphic user interface
In this chapter, we will illustrate the use of the command line interface, beginning with a set of basic commands that are necessary for navigating the file system and then focusing on several useful tools for text processing Additional commands will be introduced in the following chapters as necessary The specific command line interface
we will use throughout this book is the Terminal in Mac OS X, but the commands and
Trang 2010 2 Text Processing with the Command Line Interface
tools covered here and in the rest of the book will work in largely similar ways in the command line interfaces in UNIX and Linux, and you should be able to follow the discussion in the book in a UNIX or Linux operating system without noticing major differences For a complete introduction to the command line interfaces in Linux and UNIX, see Robbins (2005), Siever et al (2009), Shotts (2012) or other similar volumes
If you are not sure how to open the Terminal in Mac OS X, you can do this in one
of the following two ways
1 Navigate to /Applications/Utilities (i.e., first navigate to the Applications directory, then to the Utilities directory within the Applications directory), and double click on “Terminal”
2 Click on the Spotlight icon in the menu bar (shown in the upper right corner in Fig 2.1), type “Terminal” in the Spotlight box, and click on Terminal (listed after Top Hit and Applications in Fig 2.1)
When the Terminal is opened, you will see a window similar to (but perhaps not exactly the same as) the one shown in Fig 2.2 At the moment, you do not need
to be concerned with the title of the window, which shows the username (in this case xflu), the active process name bash, and the dimensions of the window
Fig 2.1 Locating the Terminal in Mac OS X via Spotlight
Fig 2.2 The Terminal in Mac OS X
Trang 2111 2.2 Basic Commands
80 × 24, or the first line of the window, which shows the date, time, and terminal
ID of the last login The second line of the window has three parts, first the name
of the computer (in this case LALSSPKMB002) followed by a colon, then the name
of the working directory (in this case ~, which is short for the home directory) lowed by a white space, and finally the command prompt (in this case xflu$) Any command you type will appear immediately after the command prompt In the next two sections, we will first introduce a set of basic commands for navigating the file system and then several tools that are useful for text processing
fol-2.2 Basic Commands
2.2.1 Notational Conventions
Throughout the book, we will use the courier font to differentiate commands, names, and directory names from regular text URL addresses mentioned in the main text will be enclosed in angle brackets The actual commands to be entered in the Terminal will be given in blocks of code, as illustrated in the examples below, where $ denotes a command prompt (do not type it) and ¶ denotes a line break (an instruction for you to press ENTER) The actual command prompt in your own Terminal will look different (as illustrated by the second line in the Terminal in Fig 2.2), but that difference is irrelevant here Lines in the blocks of code that do not begin with $ and end with ¶ indicate output generated by a command, and they should not be typed or entered into the Terminal In the case of a long command that runs two or more lines (see Sect 2.3.6 for examples), use the command-end
file-¶ to determine where the command ends You should type a multi-line command continuously (with a white space instead of a line break between lines) and press ENTER only once at the end of the command (i.e., when you reach ¶)
In the first example below, the echo command is used to simply print anything you type after it on the screen It is crucial that you type all commands exactly as they are provided, as typos as well as missing or extraneous elements (e.g., white space, single or double quotes, etc.) will likely lead to either error messages or unin-tended results This is illustrated in the second example below, where a white space
is missing between “echo” and “this”
$ echo this is going to be fun¶
this is going to be fun
$ echothis is going to be fun¶
-bash: echothis: command not found
2.2.2 Printing the Current Working Directory
The file system is hierarchically organized, and it may be easy to lose track of where you are in the hierarchy The pwd command can be used to print the location of the current working directory, as illustrated in the example below The output shows
Trang 2212 2 Text Processing with the Command Line Interface
that my current working directory is /Users/xflu, i.e., in a subdirectory called xflu under the Users directory, which is also my home directory When you first open the Terminal, you are by default located in your home directory (i.e., /Users/yourusername)1, which is the directory that contains your Desktop, Docu-ments, and Downloads folders, among others If you have difficulty conceptual-izing where your home directory is actually located, try finding it using the Finder in your Mac (open the Finder, click on “go” in the menu bar, and then click on “Home”)
$ pwd¶
/Users/xflu
2.2.3 Listing Files and Subdirectories
The ls command can be used to list the contents of a directory, including files and subdirectories As in the following example, type ls after the command prompt
to list the contents of your current working directory If you have not done thing else in the Terminal after opening it and typing the pwd command shown above, you should now see a list of subdirectories in your home directory, including Desktop, Documents, Downloads, and possibly a few others
any-$ ls¶
Desktop Documents Downloads
2.2.4 Making New Directories
The mkdir command can be used to make a new directory Let us make a new subdirectory in the home directory called corpus using the following example
We will be using the corpus directory throughout the rest of this chapter
$ mkdir corpus¶
Now, try listing the contents of the current working directory again You will see that a corpus directory is now shown in addition to the other directories that were shown previously
$ ls¶
corpus Desktop Documents Downloads
In naming directories and files, note the following general rules:
1 Names are case sensitive
2 Avoid white space in a file name or a directory name Use the underscore or dash instead to concatenate different parts of a name if necessary
3 Avoid the following characters, because they have special meanings in mands: |; , @ # $ () <> ?/\ " ' ` ~ { } [ ] = + & ^ *
com-1 If you are using a UNIX or Linux system, the path to the home directory, specifically the part preceding the username, will look different.
Trang 2313 2.2 Basic Commands
2.2.5 Changing Directory Locations
The cd command can be used to change directory locations For example, you can use the following example to change the current working directory to the corpus directory you just created
be used to change my current working directory to the programs directory I just created under my home directory (replace xflu with the name of your own home directory, i.e., your username)
$ cd /Users/xflu/corpus/programs¶
You can also change to another directory by specifying a relative path to that tory A relative path specifies the location of another directory relative to the current directory, and so it starts from the current directory rather than the root of the file system In order to explain how relative paths work, we need to first introduce two important hidden files called “.” and “ ”, respectively These files are hidden in the sense that you normally do not see them when viewing the contents of a directory The file represented by a single dot identifies the current working directory, where-
direc-as the file represented by double dots identifies the parent directory of the current working directory A relative path begins with one of these two filenames instead of a
“/” Assuming the programs directory is your current working directory, you can change your working directory to the files directory using the command below
In this command, the double dots take you to the corpus directory (i.e., the parent directory of the current working directory, which is programs), and /files then takes you to the files directory within the corpus directory
Trang 2414 2 Text Processing with the Command Line Interface
$ cd /files¶
Now that your current working directory is the files directory, try typing the first command below This command will take you two levels up the directory hierarchy: The first double dots take you to the parent directory of the files directory, i.e., the corpus directory, and the second double dots then take you to the parent direc-tory of the corpus directory, i.e., your home directory You can verify whether this
is the case with the pwd command, as shown in the second command below
$ cd /corpus/files¶
In practice, however, if you are trying to get to a child or grandchild directory of the current directory, it is not necessary to type / and you can start directly with the name of the child directory instead Let us return to the home directory with the first command below (where ~ is shorthand for the home directory) and then get to the files directory with the second command
$ cd ~¶
$ cd corpus/files¶
Remember, if at any point you are lost in the directory hierarchy, you can always identify your current working directory with the pwd command, check out the con-tents of the current working directory with the ls command, and, as a last resort, return to the home directory from wherever you are with the cd ~ command
2.2.6 Creating and Editing Text Files with UTF-8 Encoding
In general, the text files that we will be working with will be in plain text mat (saved with the “.txt” suffix) rather than Word or PDF documents (saved with the “.doc”, “.docx”, or “.pdf” suffix) It is also desirable that the plain text files (regardless of the language they are in) be saved with UTF-8 (short for Unicode Transformation Format 8-bit) encoding to ensure compatibility with the various tools we will be introducing later
for-A character encoding system pairs each character in a given character repertoire (e.g., a letter in the English alphabet or a Chinese character) with a unique code (e.g.,
a sequence of numbers) While humans read characters the way they are written,
Trang 2515 2.2 Basic Commands
computers store and process information as sequences of numbers Character ing systems serve as a means to “translate” characters in written form into codes that can be decoded by computer programs There are many national and international character encoding standards, which differ in terms of the number and types of char-acters they can encode as well as the types of codes that the characters are translated into Not all encoding systems have a large enough capacity (or code points) to en-code all characters (consider the large number of symbols required in scientific and mathematic texts), and the same character is often represented using different codes
encod-in different systems To ensure that a text can be displayed and processed correctly
by a specific computer program, it is necessary to choose an encoding system for the text that covers all the characters in the text and that is compatible with the com-puter program This is especially relevant when the text is in a language other than English To get a sense of what happens when an inappropriate encoding system is used for a text, try the following:
1 Open a web page in Chinese, e.g., <http://www.nankai.edu.cn> or French, e.g.,
<http://news.google.fr> in any web browser
2 Click on “View” in the menu bar of the browser; under “Character Encoding” (in Firefox), “Encoding” (e.g., in Chrome), or “Text Encoding” (in Safari), select an encoding system that is intuitively incompatible with the web page For example, for the Chinese web page, select an encoding system that starts with “Western”, such as “Western (ISO-8859-15)” or “Western (ISO Latin 1)”; for the French web page, select an encoding system for Chinese, such as “Simplified Chinese (GBK)” or “Simplified Chinese (GB2312)”
3 You will see that many characters on the web pages will be displayed incorrectly.The Unicode Standard solves the problems introduced by the existence of multiple encoding systems by assigning unique codes to characters in all modern languages and all commonly used symbols There are seven character encoding schemes in
Unicode, among which UTF-8 is the de facto standard for encoding Unicode on
UNIX-based operating systems; it is also the preferred encoding for multilingual web pages, programming languages, and software applications As such, it is desir-able to save texts, particularly non-English texts, with the UTF-8 encoding For further information about UTF-8 encoding or the Unicode Standard (including its other six encoding schemes) in general, consult the Unicode Consortium webpage.2
We will not look at how to create or edit plain text files through the command line interface, as in Mac OS X this can be done easily in a text editor that you are al-ready familiar with, such as Microsoft Word or TextEdit Let us now create a simple text file with the name myfile.txt and save it to the files folder with UTF-8 encoding Make sure the file contains the following two lines only (press ENTER once at the end of each line), with no extra empty lines before or after them Note that, any formatting of the text (e.g., highlighting, italicizing, bolding, underlining, etc.) will not be saved in the plain text file If this sounds trivial to you, you can do this directly on your own and skip the next two paragraphs
2 <http://www.unicode.org>
Trang 2616 2 Text Processing with the Command Line Interface
This is a sample file
This is all very simple
To generate this file using Microsoft Word, open a new file in Microsoft Word, type the two English sentences mentioned above, and then save the file in the following steps
1 Click on “File” in the menu bar and then click on “Save As…”
2 Enter myfile as the filename in the “Save As:” box and choose “Plain Text (.txt)” for the “Format:” box
3 Locate the files folder (under the corpus subdirectory in your home directory) and click on “Save”
4 At this point, a “File Conversion” dialog box will pop up (see Fig 2.3) Click on
“Other encoding” and then choose “Unicode 6.0 UTF-8” Choose “CR/LF” for
“End line with:” Click on “OK”
TextEdit can be used for the same purpose in a similar fashion To open TextEdit, type “TextEdit” in the Spotlight box and then click on “TextEdit”, similar to how you opened the Terminal (see Fig 2.1) Now type the two English sentences men-tioned above in the editor To save the file in plain text format with the name myfile.txt in the files folder, follow the following steps:
1 Click on “Format” in the menu bar and then click on “Make Plain Text”
2 Click on “File” in the menu bar and then click on “Save”
3 Enter myfile in the “Save As:” box and choose “Unicode (UTF-8)” for the
“Plain Text Encoding:” box
4 Locate the files folder (under the corpus subdirectory in your home tory), and click on “Save”
direc-2.2.7 Viewing, Renaming, Moving, Copying, and Removing Files
In this section, we will learn a set of commands that can be used to view, rename, copy, delete, and move files Whereas you can perform these tasks easily with the graphic user interface, you will find it more efficient to get them done via the com-mand line interface sometimes, especially when you are dealing with a large num-ber of files or if you are already working on some files via the command line.Before we start, first make sure that you have created the file myfile.txt and saved it to the files folder following the instructions in Sect 2.2.6 Next,
go to <http://tinyurl.com/corpusmethods> (hosted on Google Drive) and download the following three files: mylist.txt, mypoem.txt, and speech.txt to the files folder We will be using these files for illustration purposes throughout the rest of this chapter The file mylist.txt contains part-of-speech and frequency information for the 3,000 most frequent unlemmatized words in the British National Corpus (BNC) Each row in the file contains three tab-delimited columns or fields:
a word (in lowercase), a tag indicating its part-of-speech category, and its frequency
Trang 2717 2.2 Basic Commands
Trang 2818 2 Text Processing with the Command Line Interface
in the BNC.3 The part-of-speech tags will be discussed in detail in Chap 3 The file mypoem.txt contains a short poem “Men Improve with the Years” by the Irish poet William Butler Yeats Finally, the file speech.txt contains the tran-script of the speech “I Have a Dream” delivered by Martin Luther King, Jr on August 28, 1963
If for any reason your current working directory is no longer files, change it back to files using the first command below, and then use the second command to verify that it contains the following four files: myfile.txt, mylist.txt, my-poem.txt, and speech.txt
$ cd ~/corpus/files¶
$ ls¶
myfile.txt mylist.txt mypoem.txt speech.txt
The more command can be used to display the content of a text file on the screen Use the first example below to view the content of myfile.txt Since the text is short, the command prompt will be displayed in the next line immediately following the end of the text Use the second example below to view the content of mylist.txt (the output of the command is omitted here) As the text has 3,000 lines and is longer than the remaining space in the Terminal, only the first screen is shown You can press the SPACE bar on the keyboard to continue to the next screen or press Q
on the keyboard to exit the file and return to the command prompt
$ more myfile.txt¶
This is a sample file
This is all very simple
$ more mylist.txt¶
If you want to know the size of a file, you can use the wc command to display the number of lines, words, and characters in it The first example below shows that myfile.txt has 2 lines, 10 words (as delimited by white space), and 46 charac-ters (including white spaces and line breaks)
$ wc myfile.txt¶
2 10 48 myfile.txt
In the case of a long file, sometimes you may wish to view only the first or last few lines, instead of the whole file The head and tail commands can be used for these purposes The first example below shows the first 10 lines (by default, including empty lines) of mypoem.txt You can also specify the exact number of lines you wish to view from the top with a command line option, in this case a dash followed by a number This is illustrated in the second example below, which shows the first 5 lines of mypoem.txt
$ head mypoem.txt¶
Men improve with the Years
3 This file was adapted from the file all.num.o5 made publicly available by Adam Kilgarriff
at <http://www.kilgarriff.co.uk/BNClists/all.num.o5>
Trang 2919 2.2 Basic Commands
by W B Yeats (1865-1939)
I am worn out with dreams;
A weather-worn, marble triton
Among the streams;
And all day long I look
Upon this lady's beauty
As though I had found in book
A pictured beauty,
$ head -5 mypoem.txt¶
Men improve with the Years
by W B Yeats (1865-1939)
I am worn out with dreams;
A weather-worn, marble triton
The tail command can be used to view a file from the bottom The first example low shows the last 10 lines (by default, including empty lines) of mypoem.txt You can again specify the exact number of lines you wish to view from the bottom with a command line option, as illustrated in the second example below, which displays the last 5 lines of mypoem.txt We have omitted the output of these two examples here
be-$ tail mypoem.txt¶
$ tail -5 mypoem.txt¶
The mv command can be used to rename a file Use the first example below to rename myfile.txt as myfile2.txt The first filename after mv is the origi-nal filename, and the second one is the new filename The second example below shows that the filename has been successfully changed
$ mv myfile.txt myfile2.txt¶
$ ls¶
myfile2.txt mylist.txt mypoem.txt speech.txt
The mv command can also be used to move a file to a different directory For ple, the following example can be used to move myfile2.txt to the programs folder you created earlier under the corpus directory In this case, after typing mv,
exam-we need to first specify the name of the file to be moved, which is myfile2.txt, and then specify the path to the target directory, which is /programs/ (recall the discussion on relative paths above)
$ mv myfile2.txt /programs/¶
You can also move a file and give it a new name simultaneously Use the following example to move myfile2.txt from the programs folder back to the files folder and change its name back to myfile.txt Here, after typing mv, we first specify the location and name of the file to be moved (which is /programs/myfile2.txt), and then specify the target location as well as the new name of
Trang 3020 2 Text Processing with the Command Line Interface
the file Since the target location is the current working directory (i.e., the files directory), we only need to specify the new filename (i.e., myfile.txt)
$ mv /programs/myfile2.txt myfile.txt¶
The cp command can be used to make a copy of a file, either in the same directory
as the original file or in a different directory Use the first example below to make a copy of myfile.txt with the name myfile2.txt in the same directory Then use the second example below to make a copy of myfile.txt with the name my-file2.txt in the programs folder Finally, use the third example to verify that the programs folder contains the file myfile2.txt
re-$ rm myfile2.txt¶
$ rm /programs/myfile2.txt¶
2.2.8 Copying, Moving, and Removing Directories
The cp, mv, and rm commands can also be used to copy, move and remove directories In the following examples, the first command creates a subdirectory temp within the current working directory (which should still be files) The second command makes a copy of the temp directory in the programs directory, where -r is a command line option that allows cp to copy a directory The third command removes the temp directory from the programs directory The fourth command moves the temp directory from the files directory to the programs directory The last command removes the temp directory from the programs directory again
Trang 3121 2.2 Basic Commands
2.2.9 Using Shell Meta-Characters for File Matching
A group of characters have special meanings when used in commands in the mand line interface These characters are sometimes called shell meta-characters In this section, we will introduce two shell meta-characters that can be used to specify all the files or directories whose names match a certain pattern
com-The character “*” can be used to match any group of characters of any length, and the character “?” can be used to match a single occurrence of any single character
To illustrate the functions of these two characters, let us first make two copies of the file myfile.txt and save them as myfile2.txt and myfile3.txt (see the first two commands below) The third command below displays the contents of all the files that start with “my”; the fourth one displays all the files with the “.txt” suffix; the fifth one displays all the files in the current working directory; the sixth one displays all the files that start with “myfile”, followed by any single character, followed by “.txt” (in this case myfile2.txt and myfile3.txt); and the last command removes the same set of files displayed by the sixth command (i.e., myfile2.txt and myfile3.txt)
com-$ man ls¶
$ man rm¶
We will conclude this section with a brief mention of two useful “tricks” that can
be used to help reduce the amount of typing (and minimize typing errors) The first trick is what is known as command history If you press the up or down arrow
on the keyboard, you will be able to move up or down the list of commands you recently entered to rerun, modify or just examine a command The second trick is what is known as command line completion, which allows you to type the first few characters of a filename, directory name, or command name and press the “Tab” key on the keyboard (known as the completion key) to automatically fill in the
Trang 3222 2 Text Processing with the Command Line Interface
rest of the name For example, if your current working directory is ~/corpus/, you can use the following command to view the content of speech.txt in the files directory When typing the command, you can press the “Tab” key after entering the letter “f” to have the rest of the directory name files automatically filled in and then again after you enter the letter “s” to have the rest of the filename speech.txt automatically filled in Note, however, if the names of two or more commands, files, or directories start with the same few characters, you need to type enough characters for the system to disambiguate what command, file, or directory you are trying to type For example, to get to the file myfile.txt, it is necessary
to type “myf” before pressing the “Tab” key, as there are other files whose names start with “my” in the files directory
$ more files/speech.txt¶
2.3 Tools for Text Processing
In this section, we will introduce a set of useful commands and tools for text ing, including the egrep command for searching for a string in one or more text files, the tr command for translating one set of characters in a text file into another set of characters, the sed command for editing a text file directly from the command line, and the awk command for filtering and manipulating data organized as records and fields (e.g., rows and columns) In introducing these commands, we will also examine the basic usage of regular expressions and touch upon the sort command for sorting lines in a text file and the uniq command for removing repetitive lines from a text file
process-It is important to note that, for all the examples in this section, it is assumed that your current working directory is the files directory If you lost track of your loca-tion or have opened a new Terminal window, you can always use the following com-mand to set your current working directory to files Furthermore, if you skipped Sects 2.2.6 and 2.2.7 above, please follow the instructions in those sections with re-spect to creating and downloading the text files that we will be using in this section
$ cd ~/corpus/files¶
2.3.1 Searching for a String with egrep
The egrep command can be used to search for lines containing a literal string or a string that matches a specific pattern in one or more text files This is useful when you have a big file and need to extract all lines containing a particular string from that file or when you have a large number of files and need to determine which files contain a particular string Whereas you can also search for literal strings in
a document in text editors like Microsoft Word, egrep allows you to search for complex patterns in addition to literal strings, to search any number of text files at the same time, and to directly save the lines retrieved to a text file
Trang 3323 2.3 Tools for Text Processing
The first example below searches myfile.txt for the string “simple” As this example shows, following the command name egrep, you need to first specify the actual string you are searching for (in this case “simple”), enclosed in single quotes, and then the file you are searching (in this case myfile.txt) The output will include all lines in the file that contain the string being searched The second example below illustrates how you can save the output to a file instead of printing
it out on the screen In this example, “ > ” is the character used to redirect the output
to a file, and result.txt is the name of the new text file that the output will be redirected to You can use the last two examples to check the content of result.txt and then delete it to keep the folder uncluttered
$ egrep 'simple' myfile.txt¶
This is all very simple
$ egrep 'simple' myfile.txt > result.txt¶
$ egrep 'very simple' myfile.txt¶
This is all very simple
$ egrep very simple myfile.txt¶
egrep: simple: No such file or directory
myfile.txt:This is all very simple
It is important to note that, by default, the string you search for can appear anywhere
in a line in the text and does not need to be a word or phrase on its own If you want
to specify that the string you are searching for should be a word or phrase of its own, you need to enclose the string in “\ < ” and “\ > ” For example, the first command below returns both lines in myfile.txt, because they both contain the letter “a” somewhere The second command, however, returns only the first line in myfile.txt, as the second line does not contain the word “a”
$ egrep 'a' myfile.txt¶
This is a sample file
This is all very simple
Trang 3424 2 Text Processing with the Command Line Interface
$ egrep '\<a\>' myfile.txt¶
This is a sample file
A highly useful command line option for egrep is the -v option, which can be used to search for lines that do not contain the pattern specified The example below searches for lines that do not contain the string “simple” in myfile.txt
$ egrep -v 'simple' myfile.txt¶
This is a sample file
If you want to search for a string in a few files at the same time, you can list the filenames one by one after the search string You can certainly also use the shell meta-character “*” to match all files in the current working directory The first example below searches for lines containing the word “all” in myfile.txt and mypoem.txt simultaneously The second example below searches for the same lines in all text files in the directory
$ egrep '\<all\>' myfile.txt mypoem.txt¶
myfile.txt:This is all very simple
mypoem.txt:And all day long I look
$ egrep '\<all\>' *.txt¶
myfile.txt:This is all very simple
mypoem.txt:And all day long I look
2.3.2 Regular Expressions
From time to time, you may need to search for lines of text containing a specific tern (e.g., a three-letter word that starts with “b” and ends with “d” with any vowel letter in between) instead of a literal string (i.e., a specific word, phrase or string of characters) This is where regular expressions become useful Regular expressions are sequences of characters that specify patterns, and they can be used in UNIX tools such as egrep to search for patterns in text, to replace strings that match specified patterns with something else, as well as to manipulate strings in text in many other useful ways In this section, we will briefly introduce the basic usage
pat-of two types pat-of regular expressions that can be used with various UNIX tools in the command line, i.e., basic and extended regular expressions For a comprehensive introduction to regular expressions, including the types of regular expressions used
in scripting languages such as Perl and Python, see Friedl (2006)
Regular expressions bear some resemblance to the file matching patterns briefly discussed in Sect 2.2.9 above, in that they also generally contain a combination of literal and special characters However, it is important to keep in mind that while some special characters used in regular expressions act in the same way as shell meta-characters, some do act differently (e.g., the “*” character) In addition, regu-lar expressions are always quoted, while file matching patterns are not
Trang 3525 2.3 Tools for Text Processing
A crucial component of learning how regular expressions work lies in learning the meanings and functions of special characters While basic and extended regular expres-sions make use of the same set of special characters, a key difference between them
is that in basic regular expressions, the special characters ‘?’, ‘ + ’, ‘{’, ‘|’, ‘(’ and ‘)’ lose their special meanings, and a backslash needs to be used prior to them for them
to function In addition, the two types of regular expressions are also compatible with different types of UNIX tools In egrep, for example, extended regular expressions are used In the rest of this chapter, we will illustrate the usage of extended regular expressions with egrep The usage of basic regular expression, which is largely similar, will be illustrated when we discuss sed in Sect 2.3.4
Positional Anchors Positional anchors are characters that can be used to specify
the position of the expression to be matched in a line of text The most commonly used positional characters are the caret ‘^’ and the dollar sign ‘$’, which are placed
at the beginning or end of a regular expression to match the beginning or end of a line, respectively The following two examples search for lines that start with “But” and end with “eyes” in mypoem.txt, respectively
$ egrep '^But' mypoem.txt¶
But I grow old among dreams,
$ egrep 'eyes$' mypoem.txt¶
Pleased to have filled the eyes
Wildcards and Repetition Operators In regular expressions, the period can be used
as a wildcard to match any single character For example, “s.mple” matches a string with any one character in between “s” and “mple”, such as “simple” and “sample” (but also “sumple”, “s2mple”, etc., although they do not exist in myfile.txt), as illustrated in the example below
$ egrep '\<s.mple\>' myfile.txt¶
This is a sample file
This is all very simple
There are three basic repetition operators in regular expressions, namely, the isk “*”, the question mark “?”, and the plus sign “ + ” The asterisk “*” is used to match zero or more occurrences of the previous character The question mark “?” is used to match zero or one occurrence of the previous character, effectively making that character optional The plus sign “ + ” is used to match one or more occurrences
aster-of the previous character The first example below searches for the words “thing” and “things” in mylist.txt; the second example searches for “s” followed by one or more occurrences of “o” (i.e., “so”, “soo”, “sooo”, etc.) in myfile.txt; the third example searches for “s” followed by any number of occurrences of “o” (i.e., “s”, “so”, “soo”, “sooo”, etc.) in myfile.txt In the last example, the “.*” combination specifies zero or more occurrences of any character, and the expression therefore matches lines that start with “This” with “all” appearing somewhere later
Trang 3626 2 Text Processing with the Command Line Interface
$ egrep '\<things?\>' mylist.txt¶
things nn2 42409
thing nn1 35203
$ egrep '\< so + \>' myfile.txt¶
$ egrep '\< so*\>' myfile.txt¶
$ egrep '^This.*all' myfile.txt¶
It is also possible to specify the exact number of times a character is repeated This can be done using “{N}”, “{N,M}”, or “{N,}”, in which “N” and “M” are positive whole numbers and N is smaller than M These match N occurrences, N to M oc-currences, or N or more occurrences of the previous character, respectively The first example below matches lines containing a word that starts with “b” and ends with “h” with four characters in between The second example matches lines con-taining a word that starts with “b” and ends with “h” with four or more characters
some-“c” with five characters in between) This is so because the character represented
by “.” could be any character, including white spaces and tabs In this particular example, the sequence “but cjc” is considered a match because it starts with “b”, ends with “c”, and has five characters (one of which is a tab) in between To avoid this type of results, it is necessary to specify that the character in between cannot be
a white space character or that it must be a lowercase letter These can be done using character classes, which we turn to below
$ egrep '\<b.{5}c\>' mylist.txt¶
but cjc 454096
Character Classes In regular expressions, character classes can be used to match
a character from a specific list or within a specific range A number of character classes have been predefined with the general format [:CLASS:], in which CLASS denotes a specific class name Some commonly used character classes are given below:
[:alnum:] Any alphanumeric character (i.e., 0 to 9, A to Z, and a to z)
[:alpha:] Any alpha character A to Z or a to z
[:digit:] Any digit 0 to 9
Trang 3727 2.3 Tools for Text Processing
[:lower:] Any lowercase alpha character a to z
[:upper:] Any uppercase alpha character A to Z
[:punct:] Any punctuation symbol
[:space:] Any whitespace character
To use a predefined character class in egrep, it is necessary to enclose it in an extra pair of square brackets The first example below now attempts to match lines containing a word that begins with “b” and ends with “c”, with five lowercase let-ters in between in mylist.txt (no match found) The second example matches lines that end with the letter “s” followed by a punctuation mark in mylist.txt
$ egrep '\<b[[:lower:]]{5}c\>' mylist.txt¶
$ egrep 's[[:punct:]]$' mypoem.txt¶
I am worn out with dreams;
Among the streams;
Or the discerning ears,
For men improve with the years;
But I grow old among dreams,
Among the streams
In addition to predefined character classes, you can also use the square brackets
“[” and “]” to specify your own character class The first example below searches for lines containing a three-letter string that starts with “b” and ends with “t” with any lowercase vowel letter in between (i.e., “bat”, “bet”, “bit”, “bot”, and “but”) Within the brackets, the dash “-” can be used between two characters to indicate a range, and the carat “^” can be used to complement the list or range of characters listed in the brackets The second example below searches for lines containing a numeric character in mypoem.txt, and the third example searches for lines that
do not start with an uppercase letter in mypoem.txt In the third example, the first
“^” matches the beginning of a line, whereas the “^” in the square brackets ments the range of characters specified by “A–Z” (i.e., all capital letters) Finally, the last example searches for lines that start with a word consisting of the letter “a” followed by any number of consonant letters in mylist.txt (the first five lines
comple-of the output are shown)
$ egrep '\<b[aeiou]t\>' mylist.txt¶
Trang 3828 2 Text Processing with the Command Line Interface
a at0 2126369
at prp 478162
an at0 343063
all dt0 227737
Grouping and Alternation Operators The parentheses “(” and “)” can be used as
grouping operators to group multiple characters into a single entity The example below matches lines containing the word “the” or “there” in mylist.txt In this example, the “?” operates on the string “re”, which is grouped into a single entity
$ egrep '\<^the(re)?\>' mylist.txt¶
$ egrep 'their|there' mylist.txt¶
A final special character that is very useful in regular expressions is the backslash
“\”, which is used as the escape character Whenever a special character is part of a literal string you want to search for, you can precede that character with the back-slash, which allows the special character to be “escaped” from its special meaning and interpreted as a literal character The example below searches for lines ending with the string “streams” followed by a period, which is escaped Without the “\” before the period, the period will function as a special character that denotes any single character
$ egrep 'streams\.$' mypoem.txt¶
Among the streams
Trang 3929 2.3 Tools for Text Processing
2.3.3 Character Translation with tr
The tr command can be used to translate one set of characters in a text file into other set of characters, or to delete one set of characters altogether The first example below illustrates its general usage There are seven elements in this command: the command name tr, the source characters (i.e., the set of characters to be translated,
an-in this case “a”), the target characters (i.e., the set of characters the source ters to be translated into, in this case “e”), the “ < ” character (used to redirect the input from the file to the command), the name of the input file (myfile.txt), the
charac-“ > ” character (used to redirect the output of the command to a file), and the name
of the output file (myfile2.txt) If you use the second example below to view the content of myfile2.txt, you will see that it is the same as that of myfile.txt, except that the character “a” has been replaced by “e” everywhere
$ tr 'a' 'e' < myfile.txt > myfile2.txt¶
$ more myfile2.txt¶
This is e semple file
This is ell very simple
The source and target characters can also be specified using character classes The following two examples both translate all uppercase letters to lowercase letters
$ tr '[A–Z]' '[a–z]' < myfile.txt¶
this is a sample file
this is all very simple
$ tr '[:upper:]' '[:lower:]' < myfile.txt¶
this is a sample file
this is all very simple
The special characters “\n” and “\t” can be used to refer to line breaks and tabs, respectively The first example below translates each white space into a tab, and the second example translates each white space into a line break
$ tr ' ' '\t' < myfile.txt¶
This is a sample file
This is all very simple
Trang 4030 2 Text Processing with the Command Line Interface
op-$ tr –c '[:alnum:]' '/' < myfile.txt¶
This/is/a/sample/file//This/is/all/very/simple//
$ tr –d '[:punct:]' < myfile.txt¶
This is a sample file
This is all very simple
$ tr –s '[:space:]' < myfile.txt¶
This is a sample file
This is all very simple
$ tr –cd '[aeiouAEIOU]' < myfile.txt¶
iiaaeieiiaeie
2.3.4 Editing Files from the Command Line with sed
The sed command can be used to modify files automatically from the command line We will only cover the basic usage of the most essential function of sed here, i.e., the substitution function Whereas tr translates a set of characters into another set of characters, sed can be used to substitute strings (rather than just characters) matching a pattern with another string Let us begin with a simple ex-ample that substitutes the string “This” in myfile.txt with the string “THIS”
$ sed 's/This/THIS/g' myfile.txt > myfile2.txt¶
$ more myfile2.txt¶
THIS is a sample file
THIS is all very simple
As we are already familiar with input and output redirection, let us focus on the part enclosed in single quotes (i.e., 's/This/That/g') in the command above There are four components in this part As you probably have guessed, the “/” is used as a delimiter to separate the four components The first component is the letter