Although it is true that forms and distributions of the source and the target words are not the same, they are also not completely unrelated. As any Spanish speaker would agree, the knowledge of Spanish words is useful when trying to understand a text in Portuguese or Catalan. Any Czech speaker would agree that the knowledge of Czech words is helpful in
understanding Russian texts. Many of the corresponding Czech and Russian (or Spanish and Portuguese or Spanish and Catalan) words are cognates, (i.e. historically they descend from the same ancestor root or they are mere translations). There, a cognate pair is defined as a translation pair where words from two languages share both meaning and a similar surface form. Depending on how closely two languages are related, they may share more or fewer cognate pairs. The linguistic intuition says that the information about cognate words in the source language should help in tagging the target language. Two hypotheses are tested in the experiments with respect to cognates:
1. Cognate pairs have similar morphological and distributional properties.
2. Cognate pairs are similar in form and this tendency is strong enough to be useful.
Obviously both of these assumptions are approximations because
1. Cognates could have departed in their meaning, and thus probably have different distributions. For example, consider Spanish embarazada ‘pregnant’ vs. Portuguese embaraỗada ‘embarassed’; or Czech záivot (masc.) ‘life’ vs. Russian život (masc.)
‘belly’, Russian krasnyj (adj.) ‘red’ vs. Czech krasny (adj.) ‘nice’, and Catalan cama
‘leg’ vs. Spanish cama ‘bed’.
2. Cognates could have departed in their morphological properties. For example, the Spanish cerca ‘near’ is an adverb, whereas the Portuguese cerca ‘fence’ is a noun.
Both are derived from the Latin circa, circus ‘circle’. In another example, tema
‘theme’, borrowed from Greek, is feminine in Russian and neuter in Czech.
3. There are false cognates3— unrelated, but similar or even identical words. For exam- ple, the Spanish adjective salada means ‘salty’, whereas the Portuguese noun salada
3It is interesting that many unrelated languages have amazing coincidences. For example, the Russian gora
‘mountail/hill’ and the Konda goro ‘mountain/hill’ do not seem related; or the Czech mlada ‘young’ is a false cognate with the Arabic malad ‘youth’, but coincidentally, the words have similar meanings. This is definitely not a very frequent language phenomenon, but even though, the words are not etymologically related, finding such pairs should not hurt the performance of the system.
means ‘salad’; the Spanish numeral doce means ‘twelve’, whereas the Portuguese noun doce means ‘candy’.
Nevertheless, the assumption here is that these examples are true exceptions from the rule and that in the majority of cases, cognates will look and behave similarly. The borrowings, counter-borrowings, and parallel developments of both Romance and Slavic languages have been extensively studied, but this dissertation does not provide a survey of this research.
Feldman et al. (2005) report the results of an experiment where 200 most frequent nouns from the Russian development corpus are manually translated into Czech. They constitute about 60% of all noun tokens in the development corpus. The information about the distribution of the Czech translations is transferred into the Russian model using an algorithm similar to the one outlined in section 7.6.2. The performance of the tagger that uses manual translations of these nouns improves by 10% on nouns and by 3.5% overall.
The error analysis reveals that many frequent Russian nouns are disambiguated by the word order information alone and that some Czech-Russian translations do not correspond well in their morphological properties and, therefore, create additional errors in the transfer process. Even though the increase in the tagger’s performance is not as high as one would expect, the results of this baseline experiment are rather promising. The following next sections describe similar experiments, where cognates are detected automatically.
7.6.1 Cognate detection
In Chapter 2 and earlier in this chapter, it has been discussed that languages are often close enough to others within their language family that cognate pairs between the two are common, and significant portions of the bilingual lexicons can be induced with high accuracy.
The approach to cognate detection does not assume access to philological erudition, to accurate Czech-Russian/Spanish-Portuguese/Spanish-Catalan translations, or even to a sentence-aligned corpus. None of these resources would be obtainable in a resource-poor setting. In the absence of this knowledge, cognates are identified automatically using the edit distance measure (Levenshtein 1966). Because edit distance is affected by the number of arguments (characters) it needs to consider, the edit distance measure is normalized by word length.
Unlike standard edit distance, the cost of operations here is dependent on the ar- guments. Similar to Yarowsky and Wicentowski (2000), it is assumed here that in any language, vowels are more mutable in inflection than consonants. Thus, for example, re- placing a for i is cheaper than replacing s for r. In addition, costs are refined based on some well-known and common, language-specific phonetic-orthographic regularities (e.g.
replacing q with c is less costly than replacing m with, say, s in Spanish and Portuguese;
or g and h in Russian and Czech). However, performing a detailed contrastive morpho- phonological analysis is undesirable, since portability to other languages is a crucial feature of the system. So, some facts from a simple grammar reference book should be enough.4 7.6.2 Cognate transfer
Edit distance, as described in Section 7.6.1, was used to obtain a list of Czech-Russian, Spanish-Portuguese, and Spanish-Catalan cognate pairs. These are used to map the emis- sion probabilities acquired on the source language to the target language. To further explain this, assume wsourceand wtargetare cognate words. Let Tsourcedenote the tags that wsourceoc- curs with in Corpussource. Let psource(t) be the emission probability of tag t (t < Tsource ⇒
4For Catalan, only the standard edit distance is actually used with the cost operations refined for vowels, as in (Yarowsky and Wicentowski 2000). Phonetic-orthographic regularities were not incorporated. This was done to determine how crucial the knowledge about the phonetic-orthographic regularities is for improving the overall performance of the system.
psource(t) = 0). Let Ttarget denote tags assigned to wtarget by the morphological analyzer, and the ptarget(t) is the even emission probability: ptarget(t) = |T1
target|. Then, assign the new
emission probability p′target(t) to every tag t ∈ Ttarget as is given in (7.7) (followed by nor- malization):
(7.7) p′target(t)= psource(t)+p2 target(t)
As usual, the performance of the system is summarized on the test corpora across all categories, and detailed evaluations are reported for three major parts of speech — nouns, adjectives, and verbs. These reports are in Tables 7.28–7.35. For comparison,
Target Russian
trans Czech CzechRu Polish Interlingua Hybrid CzechRu
emiss evenRu evenRu evenRu evenRu evenRu RuCognates
Full tag: 78.6 81.4 74.6 80.5 79.8 83.7
POS 92.7 91.4 91.9 91.6 92.9 91.4
SubPOS 90.9 89.2 90.2 89.4 91.3 89.2
Gender 91.1 88.7 88.4 90.6 90.3 91.9
Number 94.0 94.5 92.7 94.0 94.0 93.7
Case 87.6 88.9 81.2 86.7 86.1 89.2
Table 7.28: Evaluation of Russian tagging of all categories with various parameters.
the various other models tested to this point are included in the tables as well, all with evenly distributed emission probabilities. For Russian, the performance improves by 5.1%
overall, by 20.5% for nouns, by 6.8% for verbs, and by 12.7% for adjectives. For Catalan, the performance improves by 4.5%, by 10.8% for nouns, 4.2% for verbs, and by 12.1% for adjectives. For Portuguese, the performance improves by 4.9% overall, 9.9% for nouns, 4.0% for verbs, and by 9.8% for adjectives.
Target Russian
trans Czech CzechRu Polish Interlingua Hybrid CzechRu emiss evenRu evenRu evenRu evenRu evenRu RuCognates
Full tag: 65.8 77.1 57.2 63.0 68.9 86.3
POS 94.5 92.3 93.7 92.3 94.3 92.3
SubPOS 94.5 92.3 93.7 92.3 94.3 92.2
Gender 83.5 79.8 79.3 79.7 84.9 92.2
Number 90.1 94.1 89.3 90.8 92.1 90.9
Case 76.9 87.1 66.9 72.7 76.6 88.2
Table 7.29: Evaluation of Russian tagging of nouns with various parameters.
Target Russian
trans Czech CzechRu Polish Interlingua Hybrid CzechRu emiss evenRu evenRu evenRu evenRu evenRu RuCognates
Full tag: 53.0 63.5 51.8 67.3 61.4 65.7
POS 80.8 81.2 84.3 81.2 86.3 81.2
SubPOS 71.5 67.3 76.1 67.3 79.7 67.3
Gender 89.4 88.3 87.1 91.4 87.8 90.6
Number 93.4 96.2 93.7 97.2 94.7 96.2
Case 75.5 83.5 65.7 85.8 77.2 84.3
Table 7.30: Evaluation of Russian tagging of adjectives with various parameters.