Tagset design and inflected languages

Một phần của tài liệu portable language technology a resource-light approach to morpho-syntactic tagging (Trang 84 - 88)

There are several criteria to consider when developing a morpho-syntactic tag system for a language. These include

1. Degree of relevant linguistic details.

2. Tagset size.

3. Uniformity.

Chapter 7 discusses these issues in more detail. The current section summarizes the discussion of the tagsets for the Slavic and the Romance languages, based on the criteria outline above.

Good tagset design is particularly important for highly inflected languages. If all of the syntactic variations that are realized in the inflectional system were represented in the tagset, there would be a huge number of tags, and it would be practically impossible to implement or train a tagger.

Elworthy (1995) distinguished external and internal criteria for tagset design. The external criterion is that the tagset must be capable of making the linguistic (for example, syntactic or morphological) distinctions required in the output corpora. From this point of view, both M-Efor various Slavic languages, the PDT tagset for Czech, the IPI- PAN tagset for Polish, and the CLiC-TALP tagsets for Catalan and Spanish, but perhaps not the relatively coarse-grained NILC tagset, satisfy this criterion to various degrees. As was described above, these systems make different decisions as to what categories should be present in the set, but overall, they provide a rather detailed morphological analysis of the language.

The internal criterion on tagsets is the design criterion of making the tagging as effective as possible. As an example, one of the most common errors made by taggers with the LOB (Francis and Kucera 1982) and Brown tagsets (Francis and Kucera 1982), is

mistagging a word as a subordinating conjunction (CS) rather than as a preposition (IN), or vice versa. A higher level of syntactic analysis indicating the phrasal structure would be required to predict which tag is correct, and this information is not available to fixed- context taggers. The Penn treebank (Marcus et al. 1993) therefore uses a single tag for both cases, leaving the resolution, if required, to some other process.

It can be argued that a smaller tagset should translate to improved tagging accu- racy, since it puts less of a burden on the tagger to make fine distinctions. In information- theoretic terms, the number of decisions required is smaller, and hence the tagger needs to contribute less information to make the decisions. A smaller tagset may also mean that more words have only one possible tag and so can be handled trivially (assuming a lexi- con listing of the tags is available). On the other hand, more detail in the tagset may help the tagger when the properties of two adjacent words give support to the choice of tag for both of them. That is, the transitions between tags contribute the information the tagger needs. For example, if nouns and adjectives that modify these nouns are marked for case, gender, and number, the tagger can effectively model agreement in simple noun phrases by having a higher probability for a singular nominative feminine adjective followed by a sin- gular nominative feminine noun than it does for a singular nominative feminine adjective followed by a plural genitive masculine noun.

Elworthy (1995) designs an experiment to explore the relationship between tagging accuracy and the nature of the tagset, using corpora in English, French, and Swedish. The experiment addresses the internal design criterion. The aim of the experiment is to de- termine, crudely, whether a bigger tagset is better than a smaller one, or whether external criteria requiring human intervention should be used to choose the best tagset.

It turns out that a larger tagset generally gives higher accuracy for Swedish, French, and English for texts with no unknown words (with notable exceptions in French, where gender marking was the key factor). For the test corpora that includes “unknown” words —

words not seen during training and for which the (HMM) tagger hypothesizes all open-class tags — the results are slightly different. For the three test languages, the accuracy improves on the known words, but for unknown words, smaller tagsets give higher accuracy (again, for French, gender marking is the key factor). What seems to come out of these results is that there is not a consistent relationship between the size of the tagset and the tagging accuracy. Elworthy’s general conclusion is that the external criterion should be the one to dominate tagset design.

Elworthy (1995) suggests that what is important is to choose the tagset required for the application, rather than to optimize it for the tagger. The experiments with subtaggers in Chapter 7, in a sense are a follow-up to this work, and provide further confirmation of the results. An additional comment that can be made here is that a large tagset could be always reduced to a smaller and less-detailed one if the application demands it.

A third issue concerns the question of whether tagsets should be harmonized across languages within the same family or languages that have similar properties. The efforts of theM-Eproject have already been discussed. Additionally, a M-E-style tagset for Russian was constructed at the University of Tübingen (http://www.sfb441.

uni-tuebingen.de/c1/tagset.html). These tagsets are based on a common repertoire of grammatical classes (POSs) and grammatical categories (e.g. case, person, gender, etc.), and each tagset uses a subset of those grammatical classes or categories.

The main goal of standard morpho-syntactic specifications is to make it easier to develop multilingual applications or to evaluate language technology tools across several languages. The process of standardization is interesting from a language-typological per- spective as well.

Przepiórkowski and Woli´nski (2003) notice certain weakness with the standardiza- tion approach. The relative uniformity of the POS classes across the 9 languages of the M-Eproject is attained at the cost of introducing the grammatical category ‘type’

whose values reflect the considerable differences between POS systems of the languages involved. In addition, it is not clear that the various grammatical categories and their val- ues have the same interpretation in each language. The same point can be made about the tagset used in this thesis. Russian and Czech, though being very similar, differ significantly in many respects, and the tagset that was used for Czech is not necessarily the most optimal for Russian. Thus, for example, Czech distinguishes between the masculine inanimate and masculine animate gender, whereas Russian does not. Czech uses clitics for verb reflex- ivization (which, of course, have a special tag in the Czech tag system), whereas Russian’s reflexivization is done by the reflexive suffixes. Should a special tag for Russian reflexive verbs be introduced?

As for the Romance languages, the fact that the Spanish and the Catalan CLiC- TALP tagsets were standardized allowed a quick and efficient comparison of the properties of the languages and adaptation of the current system for these particular languages. Since for Russian and Portuguese such detailed morphological tagsets were not developed, it was decided to use the tag formalisms developed for the source languages; i.e. for Czech and Spanish, respectively.

In general, however, in order for a tagset to be reusable and comparable with sim- ilar tagsets for related languages, it must be based on a homogeneous set of clear for- mal, morphological and morpho-syntactic criteria. Only once such criteria for delimiting grammatical classes and categories are presented in detail, can those classes and categories be mapped to grammatical classes and categories of other similarly constructed tagsets.

This thesis, however, does not claim that the tagsets used for the experiments described in Chapters 6 and 7 are the most adequate linguistic descriptions of the Slavic and Romance languages.

Một phần của tài liệu portable language technology a resource-light approach to morpho-syntactic tagging (Trang 84 - 88)

Tải bản đầy đủ (PDF)

(299 trang)