Inducing POS taggers with a bilingual lexicon

Một phần của tài liệu portable language technology a resource-light approach to morpho-syntactic tagging (Trang 147 - 150)

5.2 Cross-language knowledge induction

5.2.3 Cross-language knowledge transfer without parallel corpora

5.2.3.4 Inducing POS taggers with a bilingual lexicon

Cucerzan and Yarowsky (2002) present a method of bootstrapping a fine-grained, broad- coverage part-of-speech tagger in a new language using only one person-day of data acqui- sition effort. The approach requires three resources:

1. An online or hard-copy pocket-sized bilingual dictionary.

2. A basic reference grammar.

3. Access to an existing monolingual text corpus in the language.

The steps of the algorithm are as follows:

1. Induce initial lexical POS distributions from English translations in a bilingual dic- tionary without POS tags.

2. Induce morphological analyses.

The authors notice that when the translation candidate is a single word, inducing a preliminary POS distribution for a foreign word via a simple translation list is not prob- lematic. For example, suppose the Romanian word mandat can be translated as the En- glish warrant, proxy and mandate. Each of these English words can in turn be different POS’s. Now suppose that P(N|warrant) = 67% and P(V|warrant) = 34%; P(N|proxy) = 55% and P(A|proxy) = 45%; P(N|mandate) = 80% and P(V|mandate) = 20%. Then, P(N|mandat)=(66%+55%+80%)/3=67%, which means that in the majority of cases, the Romanian word mandat is a noun.

However, if a translation candidate is phrasal, (e.g. the Romanian word mandat is translated as money order), then modeling the more general probability of the foreign word’s POS is more challenging, since English words often have multiple POS’s:

P(Tf|we1, ...wen)=P(Tf|Te1, ...Ten)∗P(Te1, ...Ten|we1, ...wen) .

The authors mention several options for estimating P(Tf|Te1, ...Ten). One is to as- sume that the POS usage of phrasal (English) translations is generally consistent across dic- tionaries (e.g. P(Nf|Ne1,Ne2) remains high regardless of publisher or language). Thus, any foreign-English bilingual dictionary that also includes the true foreign-word POS could be used to train these probabilities. Another option is to do a first-pass assignment of foreign- word POS’s based only on single-word translations and use this to train P(Tf|Te1, ...Ten) for those foreign words that have both phrasal and single-word translations. Cucerzan and Yarowsky (2002) suggest a third way to obtain the probability of the foreign-word POS’s via a third language dictionary (e.g. Romanian via Spanish). Unfortunately, the authors are not explicit about the method they apply for inducing these probabilities, but a table given

in the article states that the English translations were untagged and the training dictionary (in the case of Romanian) was Spanish-English. Presumably, the probabilities of Romanian POS’s are derived from the following series of steps: Romanian word→English transla- tions→Spanish translations with POS’s→Spanish POS’s to Romanian words via English translations. If this is indeed the case, then Cucerzan and Yarowsky (2002)’s idea is very similar to the one explored in subsequent chapters of this thesis — the idea of transferring POS information from a related language to the target language.

The next step in Cucerzan and Yarowsky (2002)’s work is to induce POS’s using morphological analysis. They explore the idea that for inducing morphological analysis it is enough to begin with whatever knowledge can be efficiently manually entered from the grammar book in several hours. The experiments to be described also explore this idea, specifically, using paradigm-based morphology for Russian, Portuguese, and Cata- lan, including only the basic paradigms from a standard grammar textbook. Cucerzan and Yarowsky (2002) create a dictionary of regular inflectional affix changes and their associ- ated POS, and on the basis of it, they generate hypothesized inflected forms following the regular paradigms. Clearly, these hypothesized forms are inaccurate and overgenerated.

Therefore, the authors perform a probabilistic match between all lexical tokens actually observed in a monolingual corpus and the hypothesized forms. In their next step, Cucerzan and Yarowsky (2002) combine these two models, a model created on the basis of dictio- nary information and the one produced by the morphological analysis. This approach relies heavily on two assumptions: 1) words of the same POS tend to have similar tag sequence behavior, and 2) there are sufficient instances of each POS tag labeled by either the mor- phology models or closed-class entries. For richly inflectional languages, however, such as Russian or Czech, data sparsity is the classical problem because of the large tagset (see the discussion in Chapter 3), so there is no guarantee that assumption (2) will always hold.

The last step in Cucerzan and Yarowsky (2002)’s POS-tagging approach is inducing the agreement features, specifically, the gender information. Unlike English, languages such as Romanian or Spanish have Adj-Noun, Det-Noun, and Noun-Verb agreement at the subtag-level (e.g. for person, number, case and gender). This information is missing in the induced tags, since it is projected from English. The assumption that the authors make is that words exhibiting a property such as grammatical gender tend to co-occur in a relatively narrow window (± 3) with other words of the same gender. Since the majority of nouns have a single grammatical gender independent of context, smoothing is performed to force nouns (which are sufficiently frequent in the corpus) toward their single most likely gender.

The other agreement features are induced in a similar fashion (but the details are omitted in the article).

The accuracy of the model on the fine-grained (up to 5-features) POS space is 75.5%. For nouns, they distinguish number, gender, definiteness, and case; for verbs – tense, number, and person; and for adjectives – gender and number.

Again, similarly to Cucerzan and Yarowsky (2002), the present work uses a basic library reference grammar book and access to an existing monolingual text corpus in the language. However, they also use a medium-sized bilingual dictionary. In this work, a paradigm-based morphology, including only the basic paradigms from a standard grammar textbook (see Chapters 6 and 7) is used instead.

Một phần của tài liệu portable language technology a resource-light approach to morpho-syntactic tagging (Trang 147 - 150)

Tải bản đầy đủ (PDF)

(299 trang)