A Resource-light Approach to Russian Morphology: Tagging Russian Using Czech Resources.. 128 II A New Resource-light Approach to Morphological Tagging of Inflected Languages 130 6 A new
Trang 1PORTABLE LANGUAGE TECHNOLOGY:
A RESOURCE-LIGHT APPROACH TO MORPHO-SYNTACTIC TAGGING
DISSERTATION
Presented in Partial Fulfillment of the Requirements for
the Degree Doctor of Philosophy in theGraduate School of The Ohio State University
ByAnna Feldman
*****
The Ohio State University
2006
Professor Christopher H Brew, Advisor
Professor Brian D Joseph
Professor W Detmar Meurers
Advisor, Graduate Program in Linguistics
Trang 2UMI Number: 3226393
3226393 2006
UMI Microform Copyright
All rights reserved This microform edition is protected against unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road P.O Box 1346 Ann Arbor, MI 48106-1346
by ProQuest Information and Learning Company
Trang 3Copyright byAnna Feldman2006
Trang 4Morpho-syntactic tagging is the process of assigning part of speech (POS), case, number,gender, and other morphological information to each word in a corpus Morpho-syntactictagging is an important step in natural language processing Corpora that have been mor-phologically tagged are very useful both for linguistic research, e.g finding instances orfrequencies of particular constructions in large corpora, and for further computational pro-cessing, such as syntactic parsing, speech recognition, stemming, and word-sense disam-biguation, among others Despite the importance of morphological tagging, there are manylanguages that lack annotated resources This is almost inevitable because these resourcesare costly to create But, as described in this thesis, it is possible to avoid this expense
This thesis describes a method for transferring annotation from a morphologicallyannotated corpus of a source language to a corpus of a related target language Unlikeunsupervised approaches that do not require annotated data at all and, as a consequence,lack precision, the approach proposed in this dissertation relies on linguistic knowledge, butavoids large-scale grammar engineering The approach needs neither a parallel corpus nor
a bilingual lexicon, and requires much less linguistic labor than the standard technology
This dissertation describes experiments with Russian, Czech, Polish, Spanish, tuguese, and Catalan However, the general method proposed can be applied to any fusionallanguage
Trang 5Por-To Batsheva Barenfeld, Mira Barenfeld, and Ilia Feldman who made me who I am, and
Gera and Naomi who like me this way.
Trang 6Even though my name is the only author on this work, many have contributed to its opment and completion — those who provided insights, comments, and suggestions, andthose who provided friendship, love, and support
devel-First I want to thank Chris Brew for (surprisingly easily) agreeing to take me ashis advisee and for being a terrific advisor Always with bright insights (mostly in theform of interrogation), always knowledgeable, always generous with his time, always withanecdotal stories and jokes, always with good advice — Chris has become an object ofappreciation It was his seminar on Corpora and Multilingual Verb Classification where Irealized that Czech is rather useful for processing Russian verbs
Special thanks goes to my friend and colleague, Jirka Hana, who contributed anincredible amount of work and ideas to this project He developed a resource-light portablemorphological analyzer which became the basis for the cross-language system described
in this thesis This work started as a joint project and many ideas developed in this thesiswere inspired by discussions with Jirka
I also want to thank Detmar Meurers, another member of my dissertation tee He was the first to introduce me to the field of Computational Linguistics and got meexcited about it Detmar gave me a lot of good advice and support over the years He man-aged to keep me always in mind, pointing to the relevant literature and tools, and making
commit-me believe that I actually can write a dissertation! Throughout the years, I took several
Trang 7seminars with Detmar, and that’s where I acquired most of the skills for working on mythesis.
I also thank Brian Joseph for always extremely insightful comments and timelyfeedback What can I say? Brian knows I am so lucky he agreed to be mycommittee member
I also want to thank people who helped me with corpora used in the experiments:Sandra Maria Aluísio, Gemma Boleda, Toni Badia, Lukasz Debowski, Maria das GraçasVolpe Nunes, Jan Hajic, Ricardo Hasegawa, Vicente López, Lluís Padró, Carlos RodríguezPenagos, Adam Przepiórkowski, and Martí Quixal
Special thanks go to Stacey Bailey for being such a nice office mate, for the hotchocolate with marshmallows, and for being ready to proofread the entire draft of thisdissertation
I would like to thank my parents for giving me the freedom of choice and alwaystrusting and supporting me Linguistics is definitely not a profession that runs in our family
Then, there is a long list of people who deserve a word of thanks because of one
or more of the following things: their teaching, their willingness to discuss whatever guistic or non-linguistic topic, their collegiality, and their friendship These are (in thealphabetical order) Luiz Amaral (a friend and an expert in Romance languages), MaryBeckman, Ilana Bromberg, Donna Byron, Angelo Costanzo (for the Catalan-Spanish falsecognates), Peter Culicover, Mike Daniels (another person who knows practically every-thing and is always ready to help), Eric Fosler-Lussier, Kordula De Kuthy, Markus Dick-inson, Edit Doron, David Dowty, Stefan Dyła, Yakov Feldman (for making my life full ofart), Zhenya Gabrilovich (There are computer scientists who can actually understand lin-guists Well, at least to some extent ), Anna Ghazaryan (a friend and a mathematician!),Jonathan Ginzburg, Jan Hajic, Hanka Hanova (swimming, hiking, cooking, reminding methat there is life beyond Oxley), Jirka Hana (a friend and colleague, whose contribution
Trang 8lin-rates a second mention), Jim Harmon, Erhard Hinrichs (for always good advice, agement, and the subtaggers idea), Beth Hume, Martin Jansche, Dimitra Kolliakou, GregKondrak, Soyoung Kang, Chandana and Rupan Kundu (I wouldn’t have finished this thesiswithout you, guys!), Bob Levine (for making me believe I can do it and for making mewant to know physics), Xiaofei Lu (my former office mate, my current lab mate, full ofgood ideas and jokes), Vahagn Manukian (for friendship, math, and grill), Arantxa Martin-Lozano (my dear friend), Dennis Mehay, Vanessa Metcalf (for the devoted friendship andmental support), Marcela Michalkova, Martin Michalek, Rick Nouwen (Utrecht, Utrecht),Carl Pollard (for making me love syntax, logic, and math), Craige Roberts, Anton Rytting(for being such a friendly office mate and for being always ready to discuss Arabic vowels,Lettish dialects, and entropy), Andrea Sims (an expert in Slavic languages), Shari Speer,Soundar Srinivasan, Maya Schwekher, Nathan Vaillette, Shravan Vasishth (strict, but fair),Pauline Welby (who always has some interesting story to tell), Don Winford, Mike White,Yael Ziv, and many, many other people Thank you all!
encour-Last but not the least, I thank Gera, who asked me not to include his name here So,read the dedication
Trang 91997 B.A English Language and Literature,
B.A East-Asian Studies,Hebrew University of Jerusalem, Israel
1997–1998 Research Assistant,
Hebrew University of Jerusalem
1999 M.A English Linguistics,
Hebrew University of Jerusalem
1999–2000 Research Assistant,
The Ohio State University
2001 Marie-Curie Fellow, Utrecht Institute of
Linguistics, The Netherlands Curie Fellow
Marie-2000–2005 Teaching Assistant,
The Ohio State University
2005 M.A Linguistics,
The Ohio State University
2005–present Language Consultant,
Zi Corporation, Canada2005–2006 Presidential Fellow,
The Ohio State University
Trang 101 Anna Feldman, Jirka Hana, and Chris Brew (2006) A Cross-language Approach
to Rapid Creation of New Morpho-syntactically Annotated Resources In ings of the Fifth International Conference on Language Resources and Evaluation (LREC), Genoa, Italy.
Proceed-2 Jirka Hana, Anna Feldman, Luiz Amaral, and Chris Brew (2006) Tagging
Por-tuguese with a Spanish Tagger Using Cognates In Proceedings of the Workshop
on Cross-language Knowledge Induction hosted in conjunction with the 11th ference of the European Chapter of the Association for Computational Linguistics (EACL), Trento, Italy, pp 33–40.
Con-3 Anna Feldman, Jirka Hana, and Chris Brew (2006) Experiments in Morphological
Annotation Transfer In Proceedings of Computational Linguistics and Intelligent Text Processing (CICLing), A Gelbukh (editor), Lecture Notes in Computer Science,
Mexico City, Mexico, pp 41–50 Springer-Verlag
4 Anna Feldman, Jirka Hana, and Chris Brew (2005) Buy One, Get One Free or
What to Do When Your Linguistic Resources are Limited In Proceedings of the Third International Seminar on Computer Treatment of Slavic and East-European Languages (Slovko), Bratislava, Slovakia.
5 Jirka Hana, Anna Feldman, and Chris Brew (2004) A Resource-light Approach to
Russian Morphology: Tagging Russian Using Czech Resources In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain,
pp 222–229
6 Jirka Hana and Anna Feldman (2004) Portable Language Technology: Russian via
Czech In Proceedings of the First Midwest Computational Linguistics Colloquium,
Bloomington, Indiana
7 Stefan Dyła and Anna Feldman (2003) On Comitative Constructions in Polish and
Russian In Proceedings of the Fifth European Conference on Formal Description of Slavic Languages, Leipzig, Germany.
8 Anna Feldman (2003) On S-Coordination and Plural Pronoun Constructions In
Balkan and Slavic Linguistics, vol.2, ed Daniel E Collins and Andrea D Sims, The
Ohio State University, Columbus, Ohio, USA, pp 49–75
9 Anna Feldman (2002) Kim and Sandy, Kim with Sandy, Just Me or Both of Us?
In Proceedings of European Summer School of Logic, Language, and Information (ESSLLI), Trento, Italy, pp 41–52.
10 Anna Feldman (2002) On NP-coordination The UiL OTS 2002 Yearbook, Utrecht,
The Netherlands, pp 39–66
11 Anna Feldman (2001) Comitative and Plural Pronoun Constructions In ings of the 17th Annual Meeting of the Israel Association of Theoretical Linguistics (IATL), Jerusalem, Israel.
Trang 1112 Anna Feldman (2000) Že: Codification of ’Hearer-Old’ Information In ings of the 27th Linguistic Association of Canada and the United States (LACUS) Forum, Houston, Texas, USA, pp 187–202.
Proceed-FIELDS OF STUDY
Major Field: Linguistics
Specialization: Computational Linguistics
Trang 12TABLE OF CONTENTS
Page
Abstract ii
Dedication iii
Acknowledgments iv
Vita vii
List of Tables xvi
List of Figures xx
1 Introduction 1
1.1 Language technology and resource-poor languages 2
1.2 Morphological tagging 3
1.3 Reducing the annotation burden by cross-language knowledge induction 5
1.4 Related work 6
1.5 Dissertation structure 7
I Linguistic and Computational Foundations 9 2 Language properties 10
2.1 Slavic languages 10
2.1.1 Czech 12
2.1.2 Polish 14
2.1.3 Russian 16
2.1.4 Contrastive study 20
2.2 Romance languages 23
2.2.1 Catalan 26
2.2.2 Portuguese 32
2.2.3 Spanish 35
2.2.4 Contrastive study 38
Trang 132.3 Summary 41
3 Tagsets and corpora 43
3.1 Corpora 43
3.1.1 Slavic 43
3.1.1.1 Czech 43
3.1.1.2 Russian 44
3.1.1.3 Polish 45
3.1.2 Romance 46
3.1.2.1 Spanish 46
3.1.2.2 Portuguese 47
3.1.2.3 Catalan 49
3.2 Language population and language technology 49
3.3 Tagsets 50
3.3.1 Slavic tagsets 50
3.3.1.1 Czech 52
3.3.1.2 Russian 56
3.3.1.3 Polish 57
3.3.2 Romance tagsets 58
3.3.2.1 Spanish 58
3.3.2.2 Portuguese 59
3.3.2.3 Catalan 60
3.3.3 Tagset design and inflected languages 61
3.3.4 Why a positional tagset? 65
4 Tagging techniques 66
4.1 Supervised methods 67
4.1.1 N-gram taggers/Markov models 68
4.1.1.1 TnT (Brants 2000) 71
4.1.1.2 Tagging inflected languages with MMs 74
4.1.2 Transformation-based error-driven learning (TBL) 75
4.1.2.1 Tagging inflected languages with TBL 77
4.1.3 Maximum Entropy 77
4.1.3.1 Tagging inflected languages with the MaxEnt-tagger 80
4.1.4 Memory-based tagging (MBT) 80
4.1.4.1 Tagging inflected languages with MBT 81
4.1.5 Decision trees 82
4.1.5.1 Tagging inflected languages with decision trees 83
4.1.6 Neural networks 83
4.1.6.1 Tagging inflected languages with neural networks 85
4.2 Unsupervised methods 85
4.2.1 Markov models 85
4.2.1.1 Tagging inflectional languages with HMMs 87
Trang 144.2.2 Transformation-based learning (TBL) 87
4.3 Comparison of the tagging approaches 89
4.4 Classifier combination 89
4.4.1 Subsampling of training examples 90
4.4.2 Simple voting 92
4.4.2.1 Pairwise voting 92
4.4.3 Stacked classifiers 92
4.4.4 Combining POS-taggers 93
4.5 A special approach to tagging highly inflected languages 98
4.5.1 Exponential tagger 99
4.5.2 Other experiments 102
4.6 Summary 103
5 Previous resource-light approaches to NLP tasks 105
5.1 Unsupervised or minimally supervised approaches 106
5.1.1 Unsupervised POS tagging 106
5.1.2 Minimally supervised morphology learning 107
5.1.2.1 Yarowsky and Wicentowski (2000) 109
5.2 Cross-language knowledge induction 113
5.2.1 Cross-language knowledge transfer using parallel texts 113
5.2.2 Bilingual lexicon acquisition 114
5.2.2.1 POS tagging 115
5.2.2.2 Parsing 116
5.2.2.3 Semantic classes 118
5.2.3 Cross-language knowledge transfer without parallel corpora 119
5.2.3.1 Word sense disambiguation (WSD) and translation lexicons119 5.2.3.2 Named Entity (NE) recognition 120
5.2.3.3 Verb classes 122
5.2.3.4 Inducing POS taggers with a bilingual lexicon 124
5.2.3.5 Parsing 127
5.3 Summary 128
II A New Resource-light Approach to Morphological Tagging of Inflected Languages 130 6 A new resource-light approach to morpho-syntactic tagging The set up 132
6.1 Tag system 134
6.1.1 Slavic tagsets 135
6.1.1.1 Czech tagset 135
6.1.1.2 Russian tagset 135
6.1.1.3 Polish tagset 137
6.1.2 Romance tagsets 138
Trang 156.1.2.1 Spanish tagset 138
6.1.2.2 Catalan tagset 139
6.1.2.3 Portuguese tagset 139
6.2 Corpora 139
6.2.1 Slavic corpora 140
6.2.1.1 Czech corpora 141
6.2.1.2 Polish corpora 141
6.2.1.3 Russian corpora 141
6.2.2 Romance corpora 142
6.2.2.1 Spanish corpora 142
6.2.2.2 Portuguese corpora 142
6.2.2.3 Catalan corpora 142
6.3 Morphological analysis 143
6.3.1 Paradigms 144
6.3.1.1 Russian paradigms 146
6.3.1.2 Portuguese paradigms 147
6.3.1.3 Catalan paradigms 147
6.3.2 Closed-class words 147
6.3.3 Ending-based guesser 150
6.3.4 Lexicon-based analyzer 150
6.3.5 Abbreviation processor 151
6.3.6 Lexicon acquisition 151
6.3.7 The algorithm for lexicon acquisition 152
6.4 Quantifying language properties 153
6.4.1 Tagset size, tagset coverage 153
6.4.2 How much training data is necessary? 156
6.4.3 Data sparsity, context, and tagset size 161
6.4.4 Summary 163
7 Experiments in cross-language morphological annotation transfer 165
7.1 Why a Markov model 166
7.2 Performance expectations 167
7.2.1 The lower bounds 168
7.2.2 The upper bound 171
7.3 The basic approach 171
7.4 Upper bounds of transitions and emissions 175
7.4.1 Upper bound — word order 175
7.4.2 Upper bound — lexicon 179
7.5 Further approximation of transitions 181
7.5.1 “Russifications” 181
7.5.2 Slavic Interlingua 185
7.5.3 Combining two language models 189
7.6 Further approximation of emissions: Cognates 189
Trang 167.6.1 Cognate detection 191
7.6.2 Cognate transfer 192
7.7 Dealing with data sparsity 194
7.7.1 Tag decomposition 196
7.7.2 Combining sub-taggers 199
7.8 Evaluation and discussion 202
7.8.1 Comparing the performance of the models on different languages 202
7.8.2 Alternative ways of evaluation 205
7.9 Summary 206
8 Summary and further work 210
8.1 Summary of the thesis 210
8.2 Future work 212
8.2.1 Cognates 212
8.2.2 Other morpho-syntactic features 212
8.2.3 Other annotation schemes 214
8.2.4 Alternative evaluation methods 214
8.2.5 Other types of knowledge induction 214
8.2.6 Comparison with the standard approaches 215
8.2.7 Language closeness or size of the training data? 215
8.2.8 Other inflected languages 215
8.2.9 Cross-language morphology induction and active learning 216
8.2.10 Language transfer in language acquisition 217
Appendices 218
A Czech positional tagset 219
B Detailed specifications of the Russian positional tagset 229
C Detailed specifications of the positional tagset for Spanish, Catalan, and Por-tuguese 238
D Russian tagset 244
E Polish tag correspondences 245
F Spanish tag correspondences 246
G Portuguese tagset 247
H Catalan tag correspondences 248
I Sub-taggers: complementarity rate 249
Trang 17J Tagging accuracy on all categories for Catalan, Portuguese, and Russian 250 Bibliography 258 Citation Index 274
Trang 18LIST OF TABLES
2.1 Declension Ia – an example 17
2.2 I-conjugation – grabit’ ‘rob’ 19
2.3 Slavic: shallow contrastive analysis 21
2.4 Basic words: comparison of Russian, Czech, and Polish 23
2.5 Romance: Shallow contrastive analysis 39
2.6 Germanic influence on Spanish, Portuguese, and Catalan 40
2.7 Arabic influence on Spanish, Portuguese, and Catalan 41
2.8 Basic words: Comparison of Spanish, Portuguese, and Catalan 41
3.1 Language population, language technology 49
3.2 Positional Tag System for Czech 54
6.1 Overview and comparison of the Slavic tagsets 136
6.2 Size of Slavic tagsets 137
6.3 Overview and comparison of the Romance tagsets 140
6.4 Size of Romance tagsets 140
6.5 Masculine nouns ending on a “hard” (non-palatalized) consonant e.g student ‘student’, stol ‘table’, slon ‘elephant’ etc. 146
6.6 The -ar verbs in Portuguese 148
6.7 The -ar verbs in Catalan 149
Trang 196.8 The corpus and detailed tagset size, n-gram counts, entropy (H), mutual
infor-mation (I), and average tag/token ambiguity: Slavic, Romance, English 154
6.9 The corpus and reduced tagset size, n-gram counts, entropy (H), mutual infor-mation (I), and average tag/token ambiguity: Slavic, Romance, English 155
7.1 Lower bound of performance on all categories 168
7.2 Lower bound of performance on nouns 169
7.3 Lower bound of performance on verbs 169
7.4 Lower bound of performance on adjectives 170
7.5 Homonymy of the -a ending in Russian 170
7.6 Comparison of recall and average ambiguity in morphological analysis 171
7.7 Evaluation of the basic model on all categories 172
7.8 Evaluation of the basic model on nouns 173
7.9 Evaluation of the basic model on verbs 173
7.10 Evaluation of the basic model on adjectives 174
7.11 Overview and comparison of the tagsets 176
7.12 Upper bounds of transitions for all categories compared to the basic model 177
7.13 Upper bounds of transitions for nouns 177
7.14 Upper bounds of transitions for verbs 178
7.15 Upper bounds of transitions for adjectives 178
7.16 Upper bounds of emissions for all categories 179
7.17 Upper bounds of emissions for nouns 180
7.18 Upper bounds of emissions for verbs 180
7.19 Upper bounds of emissions for adjectives 181
7.20 Czech transitions compared with ‘russified’ transitions Evaluation on all cate-gories 184
7.21 Czech transitions compared with ‘russified’ transitions Evaluation on nouns 184
Trang 207.22 Czech transitions compared with ‘russified’ transitions Evaluation on adjectives.185
7.23 Czech transitions compared with ‘russified’ transitions Evaluation on verbs 185
7.24 Czech, Russified, Polish, Interlingua, and Hybrid transitions for all categories 187
7.25 Czech, Russified, Polish, Interlingua, and Hybrid transitions for nouns 187
7.26 Czech, Russified, Polish, Interlingua, and Hybrid transitions for adjectives 188
7.27 Czech, Russified, Polish, Interlingua, and Hybrid transitions for verbs 188
7.28 Evaluation of Russian tagging of all categories with various parameters 193
7.29 Evaluation of Russian tagging of nouns with various parameters 194
7.30 Evaluation of Russian tagging of adjectives with various parameters 194
7.31 Evaluation of Russian tagging of verbs with various parameters 195
7.32 Evaluation of Catalan and Portuguese tagging of all categories: even vs cognate-approximated emissions 195
7.33 Evaluation of Catalan and Portuguese tagging of nouns: even vs cognate-approximated emissions 196
7.34 Evaluation of Catalan and Portuguese tagging of adjectives: even vs cognate-approximated emissions 196
7.35 Evaluation of Catalan and Portuguese tagging of verbs: even vs cognate-approximated emissions 197
7.36 Russian tagger performance trained on individual slots vs tagger performance trained on the full tag 197
7.37 Russian tagger performance trained on the combination of two features vs tagger performance trained on the full tag 198
7.38 Russian tagger performance trained on the combination of three or four features vs tagger performance trained on the full tag 198
7.39 Russian tagging accuracy of the model with cognate-approximated emissions vs the voted classifier (best three subtaggers) for all categories 200
7.40 Russian tagging accuracy of the model with cognate-approximated emissions vs the voted classifier (best three subtaggers) for nouns 200
Trang 217.41 Russian tagging accuracy of the model with cognate-approximated emissions
vs the voted classifier (best three subtaggers) for adjectives 201
7.42 Russian tagging accuracy of the model with cognate-approximated emissions vs the voted classifier (best three subtaggers) for verbs 201
7.43 A contingency table for testing the models 203
7.44 McNemar’s χ2 test results for Catalan, Portuguese, and Russian 204
7.45 Number of feature changes needed to recreate gold standard 205
D.1 Sample tags for Russian nouns 244
E.1 A fragment of the Polish tagset 245
F.1 A fragment of the Spanish tagset 246
G.1 A sample of Portuguese tags 247
H.1 A fragment of the Catalan tagset 248
I.1 Complementarity rate of subtaggers (Brill and Wu 1998) 249
Trang 22LIST OF FIGURES
2.1 Slavic languages 112.2 Romance languages 24
6.1 The number of distinct tags plotted against the number of tokens for the tailed tagset 156
de-6.2 The number of distinct tags plotted against the number of tokens for the duced tagset 157
re-6.3 The percentage of the tagset covered by the number of tokens for the detailedtagset 158
6.4 The percentage of the tagset covered by the number of tokens for the reducedtagset 159
6.5 The percentage of the corpus covered by the 5 most frequent tags for the tailed tagset 160
de-6.6 The percentage of the corpus covered by the 5 most frequent tags for the duced tagset 161
re-6.7 Accession rate for the detailed tagset 1626.8 Accession rate for the reduced tagset 1637.1 An algorithm for combining subtaggers 1997.2 Complementarity rate analysis (Brill and Wu 1998) 2027.3 McNemar’s χ2 test 204
Trang 23J.1 Tagging accuracy on all categories 250J.2 Tagging accuracy on nouns 251J.3 Tagging accuracy on adjectives 252J.4 Tagging accuracy on verbs 253J.5 Tagging accuracy on subPOS 254J.6 Tagging accuracy on gender 255J.7 Tagging accuracy on number 256J.8 Tagging accuracy on case 257
Trang 24depicted in a Russian film, The Cuckoo (Kukushka, 2002) At the end of the movie, as
in any well-intentioned, man-made story, life wins and the barriers fall, giving mankind asense of hope and reconciliation
As shown in the movie, language barriers contribute a great deal to ing and miscommunication Today’s technology is doing a tremendous job of overcominglanguage barriers For instance, by using some online machine translation systems, Inter-net users can gain access to information from the original source language, and therefore,ideally, form unbiased opinions The process of learning foreign languages is also facili-tated by technology It is no longer a luxury to have an intelligent computer language tutorthat will detect and correct our spelling, grammar, and stylistic errors These are just a fewexamples of what language technology is capable of doing
misunderstand-It is unfortunate, however, that not all languages receive equal attention Manylanguages lack even the most rudimentary technological resources
Trang 251.1 Language technology and resource-poor languages
This thesis concerns the development of a method for morphological tagging of poor languages “Morphological tagging” is the process of assigning POS, case, number,gender, and other morphological information to each word in a corpus “Resource-poor”languages are languages for which few digital resources exist; and thus, languages whosecomputerization poses unique challenges “Resource-poor” languages are also those lan-guages with limited financial, political, and legal support — languages that lack the globalimportance of the world’s major languages
resource-In spite of these challenges, resource-poor languages and their speakers are notbeing ignored Individuals, governments, and companies alike are busy developing tech-nologies and tools to support such languages (e.g ILASH 2002) They are driven by avariety of motivations First, there is a sincere aspiration among academics and communityactivists to preserve or revitalize endangered or threatened languages — creating electronicresources for such languages is not the solution, of course, but an important contribution
to the enterprise Second, some governments strive to promote minority languages Third,there is a need by other governments to detect hostile chatter in diverse tongues Finally,some companies are trying to enhance their stature in emerging markets such as China andSouth America Even though the system developed in the thesis has been tested on lan-guages which are relatively resource-poor and are not endangered (Catalan, Portuguese,and Russian), the same method can be applied to any pair of related inflected languages, ifone of them has an annotated corpus
Success in natural language processing (NLP) depends crucially on good resources.Standard tagging techniques are accurate, but they rely heavily on high-quality annotatedtraining data The training data also has to be statistically representative of the data onwhich the system will be tested In order to adapt a tagger to new kinds of data, it has to
Trang 26be trained on new data that is similar in style and genre However, the creation of suchdata is time-consuming and labor-intensive It took six years to create the Brown corpus(Kucera and Francis 1967), a one-million token corpus of American English annotatedwith 87 part of speech tags, for instance If state-of-the-art performance requires this level
of annotation effort and time spent for English, what of languages that typically receiveless effort and attention, but suddenly become important? How can we ever hope to buildannotated resources for more than a handful of the world’s languages?
Resnik (2004) compares high quality translation with detailed linguistic annotationand puts them on the same order of magnitude of difficulty: turnaround times for pro-fessional translation services, based on an informal survey of several Web sites suggest aproductivity estimate of around 200–300 words per hour for experienced translators If this
is the rate of progress for this task, the prospect for manual annotation of linguistic sentations across hundreds of languages seems bleak indeed Even though it might seemlike the annotation and translation tasks require different levels of language knowledge, amere knowledge of the grammar is insufficient for doing manual morphological annotation.Languages with rich morphology are highly ambiguous: the same morphological form cancorrespond to multiple analyses — understanding the context and the meaning of words iscrucial for disambiguation
repre-1.2 Morphological tagging
The importance of the part of speech for language processing is that it gives a significantamount of information about a word and its neighbors For example, corpora that havebeen POS-tagged are very useful in linguistic research for finding instances or frequencies
of particular constructions in large corpora (e.g Meurers 2005)
Trang 27POS information can also provide a useful basis for syntactic parsing Knowing thepart of speech information about each word in an input sentence helps determine a correctsyntactic structure in a given formalism.
Knowing which POS occurs next to which can be useful in a language model forspeech recognition (i.e for deciphering spoken words and phrases) In addition, a word’sPOS can tell us something about how the word is pronounced Thus, for example, in En-
glish the verb object [@b"dZEkt] is pronounced differently from the noun object ["abdZEkt].
Knowing a word’s POS is useful in morphological generation (i.e mapping a guistic stem to all matching words), since knowing a word’s POS gives us informationabout which morphological affixes it can take This knowledge is crucial for extractingverbs or other important words from documents, which later can be used for text summa-rization, for example
lin-Automatic POS taggers can help in building automatic word-sense disambiguationalgorithms, since the meaning of individual words is related to their POS and the POS of
adjacent words For example, down prep (as in look down), down ad j (as in down payment), down verb (as in They down wild boars) do not have the same meaning.
This dissertation mainly deals with inflectional languages Inflectional information
is crucial for various tagging applications Inflections are not just another quirk of certainlanguages Inflectional languages usually have free word order In order to decide whatsyntactic relationships hold between the elements of a sentence, and what constituentsagree with what constituents, detailed morphological information is essential Morpho-logical tags carry important information which is essential for parsing or text-to-speechapplications, for instance We want not only to tell apart verbs from nouns, but also sin-gular from plural, nominative from genitive — all of which are ambiguous one way or theother For example, in order to determine which syllable of a given instance of the Rus-
sian word snega should be stressed, one must know the morphological properties of that
Trang 28instance — the genitive singular form of the word is stressed on the first syllable, while the
nominative plural form is stressed on the second: snèga.Noun.Gen.Masc.Singular ‘snow’
vs snegà.Noun.Nom-Acc.Plural ‘snows’.
1.3 Reducing the annotation burden by cross-language knowledge induction
The focus of this dissertation is on the portability of technology to new languages and onrapid language technology development This dissertation addresses the development oftaggers for languages with extremely scarce resources With respect to tagging, languageswith “scarce resources” are those that lack a large annotated corpus in electronic formor/and large lexicons This dissertation takes a novel approach to rapid, low-cost develop-ment of taggers by exploring the possibility of taking existing resources for one languageand applying them to another
Languages that are either related by common heritage (e.g Czech and Russian) or
by borrowing (or “contact”, e.g Bulgarian and Greek) often share a number of ties: morphological systems, word order, and vocabulary A related language, therefore,provides a point of departure for the porting of information from one language to another
proper-This dissertation describes a knowledge- and resource-light system for automaticmorphological analysis and tagging of inflected languages with scarce resources Themethod avoids the use of labor-intensive resources; instead, it relies on the following:
1 an annotated corpus of a source language
2 an unannotated corpus of a related target language
3 a description of the target language morphology (either taken from a basic grammarbook or elicited from a native speaker)
The approach described in this thesis takes the middle road between free approaches and those that require extensive manually created resources For the ma-jority of languages and applications, neither of these extreme approaches is warranted The
Trang 29knowledge-knowledge-free approach lacks precision and the knowledge-intensive approach is usuallytoo costly.
This dissertation explores the possibility of transferring relevant linguistic tion from a source (resource-rich) language into a target (resource-poor) language Theexperiments discussed include both Slavic and Romance languages This is the first sys-tematic study to investigate the possibility of adapting the knowledge and resources ofone morphologically rich language to process another related inflectional language with-out the use of parallel corpora or bilingual lexicons The main scientific contribution is toprovide a better understanding of the generality or language-specificity of cross-languageannotation methods The practical contribution consists of developing and implementing
informa-a portinforma-able system for tinforma-agging resource-poor linforma-anguinforma-ages Finding effective winforma-ays to informa-adinforma-apt informa-atagger which was trained on another language with similar linguistic properties has poten-tial to become the standard way of tagging languages for which large, labeled corpora arenot available
1.4 Related work
The idea of “information transfer” is not new, especially in areas such as the study ofSecond Language Acquisition (SLA) As the name suggests, SLA research focuses onhow humans acquire a new language (commonly referred to as an L2) One importantissue is that prior knowledge of the first language (L1) can affect the acquisition of theL2 (cf Odlin 1990) This thesis does not address the issue of how the established L1lexicon (in particular, morpho-syntactic knowledge) interacts with the emerging L2 lexicon
of a language learner, nor does it address the nature of the mental organization of suchknowledge What is of interest here is how the idea of L1 and L2 morpho-syntactic andlexical interaction can carry over to a machine-learning setting In particular, the goal is
Trang 30to use annotated data in one language to aid the automatic learning of morpho-syntacticinformation in another language.
1.5 Dissertation structure
The structure of the dissertation is as follows Part I (Chapters 2-5) lays out the linguisticand computational foundations of the current work Chapter 2 describes the six languagesused in the experiments: Czech, Polish, Russian, Catalan, Portuguese, and Spanish Thedescription mainly concentrates on the properties relevant for the task of morphologicaltagging Chapter 3 summarizes the tagsets and the corpora that exist for these six lan-guages This is not intended as an exhaustive list of all the language resources, but itprovides an overview and discussion of the major tag sets, tag systems, and annotated cor-pora developed for these languages It also comments on the standardization of tagsetsacross languages Chapter 4 provides a survey of tagging techniques as well as classifiercombination methods A number of supervised and unsupervised methods are describedand compared, and the final sections of the chapter are devoted to the question of the ap-propriateness of these methods for inflected languages in general, and for Romance andSlavic languages in particular Chapter 5 summarizes previous resource-light approaches
to Natural Language Processing (NLP) tasks This chapter takes a closer look at two strapping solutions, both because they are fairly well-researched and because they seempromising for the problem of creating language technology for resource-poor languages
boot-At the same time, there are some theoretically interesting questions as to their generalapplicability, which the chapter addresses as well One solution to be discussed is unsu-pervised or minimally supervised learning of linguistic generalizations from corpora; theother one is cross-language knowledge induction
Part II (Chapters 6-8) introduces the new resource-light approach to morphologicaltagging of inflected languages Chapter 6 discusses the tag system that has been developed
Trang 31in this dissertation for processing Russian, Portuguese, and Catalan The chapter also troduces the portable resource-light approach to morphological analysis (Hana 2005) used
in-in this dissertation It is shown that usin-ing a detailed tagset for richly in-inflected language isbeneficial, and that reduction in the tagset size does not necessary lead to better results It
is also shown that even though the inflected languages are considered to be relatively free
in their word order, n-gram techniques succeed reducing the average tag/word ambiguity.
Chapter 7 discusses a range of experiments in cross-language morphological annotationtransfer The experiments are divided into three types: those that deal with approximatingthe word order of the target language using the source language; those that deal with ap-proximating the lexical information of the target language from the lexicon of the sourcelanguage; and those that deal with data sparsity problems The first two experiments are re-ported for all language pairs – Russian-Czech, Portuguese-Spanish, and Catalan-Spanish;the third experiment is reported only for Russian Chapter 8 summarizes the work anddescribes the future direction of research arising from this thesis
Trang 32Part I Linguistic and Computational
Foundations
Trang 33CHAPTER 2
LANGUAGE PROPERTIES
This chapter describes the six languages used in the experiments Three (Czech, Polish,and Russian) belong to the Slavic family, and three (Catalan, Portuguese, and Spanish)belong to the Romance group of languages Since the goal of the task is to project morpho-syntactic information from a source language to a target language, the discussion concen-trates mainly on characterizing the morpho-syntactic properties of these languages
2.1 Slavic languages
Slavic (Slavonic) languages are a group of Indo-European languages spoken in most ofEastern Europe, much of the Balkans, part of Central Europe, and the northern part of Asia
As Figure 2.1 illustrates, Slavic languages are divided into three branches: South, West, and
East Slavic The South Slavic branch is split further into Western and Eastern subgroups.The Western subgroup is composed of Slovenian, Serbian, Bosnian, and Croatian Thelanguages from the Western subgroup are spoken in Slovenia, Bosnia and Herzegovina,Croatia, Serbia and Montenegro, and the adjacent regions The Eastern subgroup consists
of Bulgarian in Bulgaria and adjacent areas, and Macedonian in the Republic of Macedonia,Bulgaria, Greece and Albania West Slavic includes Czech in the Czech Republic andSlovak in Slovakia, Upper and Lower Sorbian in Germany, and Lekhitic (Polish and its
Trang 34Figure 2.1: Slavic languages
related dialects, Kashubian, Polabian, Obodrits) Russian, Ukrainian and Belarusian belong
to the East Slavic branch
Below a description of three Slavic languages — Czech, Polish, and Russian — isprovided These are the three Slavic languages used in the experiments reported in Chap-ters 6 and 7 The description of these languages is based on Comrie and Corbett (2002)
It is worth reemphasizing that this is not an exhaustive, contrastive study of the languages.The focus is on the properties relevant to the cross-lingual induction of morpho-syntacticinformation Facts about the derivational morphology of these languages are completely
Trang 35omitted These languages are compared in terms of their declensional, conjugational tems, word order patterns, subject-verb and adjective-noun agreement, and (non-)existence
sys-of special clitics
2.1.1 Czech
General. The Czech language is one of the West Slavic languages It is spoken by mostpeople in the Czech Republic and by Czechs all over the world — about 12 million nativespeakers in total (http://www.ethnologue.com)
Morphology. Czech is a richly inflected language like other Slavic languages Czechnouns and adjectives distinguish gender, number, and case, and in some cases, animacy.There are seven cases: nominative, accusative, genitive, dative, instrumental, locative, andvocative About half the singular noun paradigms have a distinctive vocative form shared
by no other case; no adjectival, pronominal, numeral or plural noun paradigms have distinctvocative forms (i.e vocative = nominative) There are three genders, the subcategory ofanimacy functioning within the masculine only In the singular, animate accusative equalsgenitive, which itself, in the core (hard) masculine paradigm, differs from the inanimategenitive Similarly, animate dative and locative usually differ from their inanimate equiva-lents In the plural, the animate and inanimate differ only in nominative
As in other Slavic languages, the morphology of numerals is complex in Czech Forexample, among the cardinal numbers, only ‘1’,‘2’, ‘3’, and ‘4’ function adjectivally andretain the morphology of case The inflection of the other cardinal numerals is limited to
the oblique-case ending -i Ordinal (multidigit) numbers have all digits in the ordinal form (e.g dvacátý pátý, ‘25th’) and are fully declining Two-digit numerals between whole tens may have an inverted one-word form (e.g pˇetadvacátý, ‘25th’).
Trang 36For verbs, person is expressed through inflection Three tenses are recognized, asuperficially simple system refined by the Slavonic aspects Present time meanings areexpressed by the basic conjugated forms The imperative is expressed morphologically insecond and first person plural, and analytically in others The conditional is expressed by a
combination of a verb with conjugated enclitic auxiliary by.
Five main conjugational types of verbs are recognized They are distinguished on
the basis of the third person singular form, marked by the following endings: (I) -e, (II) -n-e, (III) -j-e, (IV) -í, and (V) -a Class V is a historic innovation, born of the contraction
of once disyllabic endings and assimilation to the verb dát.
It is perhaps important to mention that there is a significant difference in ogy and the lexicon between the standard and colloquial levels of Czech The automaticmorphological analysis of such a language is especially challenging since the same wordcan have several morphological forms, depending on the language level In addition, thisfeature of Czech suggests that a larger tagset is needed to describe all the properties of thelanguage in full detail
morphol-Syntax. In syntax, verbs agree with their subjects in number, person (in present forms),gender (in past forms) and animacy (in masculine forms) Adjectives agree with nouns theymodify in case, number, gender, and animacy As with other Slavic languages, Czech is aso-called free-word-order language, where the order of syntactic constituents is determined
by pragmatic constraints However, the position of adjectives is relatively rigid before thenouns they qualify, as is the position of dependent infinitives following the verbs on whichthey depend Another issue in word order is the placing of enclitics, elements lacking wordstress, which generally follow the first stressed constituent in the clause Czech encliticsinclude the past and conditional auxiliaries, the “weak” forms of the personal pronouns,
Trang 37the conjunction li, and a small number of particles The main copular verb is být and its frequentative bývat; it can never be omitted.
Reflexivity is expressed primarily by the free morpheme se It is often described
as a particle rather than a pronoun on the grounds of the many functions in which it isreferentially empty, and because under emphasis or where agreement might be required, itbehaves differently from other pronoun objects
Sentence negation in Czech is formed by the prefix ne- attached to the verb As
in other Slavic languages, negative elements accumulate; any negative subject or object
pronoun or pronoun-adverb is reinforced by ne- in the verb Unlike the other languages
described in this section, Czech is the only one that does not have the “genitive of tion” phenomenon, i.e the situation when an accusative object becomes genitive if a verbselecting it is negated The direct object after a negative is in the accusative (except forsome archaic cases)
nega-2.1.2 Polish
General. The Polish language belongs to the West branch of Slavic languages Polish ismainly spoken in Poland In fact, Poland is the most homogeneous European country interms of its mother tongue, since close to 98% of Polish citizens declare Polish as their na-tive language According to http://www.ethnologue.com, there are 37 million speakers
of Polish in Poland The population of Polish speakers in all countries is 44 million (1999WA) In the USA, the number of Polish speakers is over 1 million
Morphology. The modern Polish number system distinguishes singular and plural Afew dialects preserve dual forms with dual meanings; much more common are remnants ofdual endings with plural meaning Polish has seven cases: nominative, genitive, accusative,dative, locative, vocative, and instrumental The nominal gender system distinguishes as its
Trang 38primary categories masculine, feminine, and neuter, with masculine nouns further dividedinto two semantically based categories, animate/inanimate and personal/non-personal Thebasic, three-way gender distinction is manifested primarily through syntactic means (agree-ment and anaphora), although particular declensional paradigms are associated with eachgender.
Most verbs distinguish all three persons in singular and plural in the present, pastand future tense, as well as in the conditional mood Gender is distinguished in the past, inthe conditional, and one variant of the imperfective future Some verbs used without anysubject have only their person singular forms, and in general, their person singular (neuter)
is the default verb form if no nominative grammatical subject is present or understood
Polish is more like Russian than like Czech in its use of perfective and imperfectiveverbs, although it employs the perfective with greater freedom (in the context of repetitionand in non-future meanings of non-past forms) Polish also lies between Russian and Czech
in its use of frequentatives (e.g jada´c ’eat (often)’, czytywa´c ’read (often)’ ) Like Russian,
Polish has few such verbs, but unlike Russian frequentatives, which are used only in thepast tense, the Polish verbs have full paradigms They are also used more often than theirRussian counterparts, but are not regular formations as in Czech There are four conjuga-tional types in Polish, distinguishing first, second, and third persons, singular and plural;thus, independent personal pronouns are used only for emphasis There are, however, hon-orific second person pronouns which are used with third person forms of the verb
Syntax. In syntax, main verbs agree in person and number with their subjects Adjectivesagree in person, gender, and case with the noun they modify
As in other Slavic languages, word order is not constrained by grammar — there is
no particular order for constituents realizing the subject, object, possessor, etc However,
Trang 39the unmarked order is Subject-Verb-Object The inflectional system is responsible for tinguishing grammatical relations and roles Pragmatic information and considerations oftopic (what the sentence is about) and focus (new information conveyed by the sentence) isimportant in determining word order Topics precede constituents that are in focus.
dis-The main copulas are the verb by´c and the particle to dis-The verbal copula is used primarily to describe, while to is used primarily to identify and define To may be combined with a form of by´c (normally third person, singular or plural) in the present tense and must
be so combined in the past or the future
The negative particle nie is used for sentence negation and for constituent negation,
as well as in word formation As in all the Slavic languages described here, multiplenegative elements can occur together with sentence negation The direct object of a negatedverb is normally genitive (the so-called ‘genitive of negation’), even if the negation is notdirectly on the transitive verb but rather on an auxiliary or other verb governing a transitiveinfinitive
2.1.3 Russian
General. Russian is an East Slavic language Russian is primarily spoken in Russia and,
to a lesser extent, the other countries that were once constituent republics of the USSR,
as well as in Israel, North America and Western Europe According to http://www
ethnologue.com, there are 167 million first-language Russian speakers in the world
Morphology. Like Czech and Polish, Russian is also a fusional language in which eral inflections are often fused into one phonetic and orthographic form For example, in
sev-the verb dela-et, sev-the suffix -et indicates sev-the person (3rd), sev-the number (Sg), and sev-the tense
(present)
Trang 40In Russian, nominal parts of speech express distinctions of case, number and genderwith different degrees of consistency and not always by the same morphological means.Number is expressed in all nominal parts of speech except numerals themselves Russianhas six primary cases (nominative, genitive, accusative, dative, instrumental, and locative)and two secondary cases (second genitive and second locative).
Nouns in Russian can be grouped into equivalence classes according to variouscriteria One such grouping is declension class (see, for example, Table 2.1); another
Hard stem Soft stemSingular
NOM ˇcin ‘rank’ kon’ ‘horse’
Plural
Table 2.1: Declension Ia – an example
is (syntactic) gender, expressed through agreement in other parts of speech — attributiveadjectives, predicative adjectives, the past tense of verbs, and pronouns Declension typeand gender are largely isomorphic — the members of a given declension or subdeclensioncondition the same agreement, and belong to the same gender The exceptions mostlyinvolve animate nouns