Báo cáo khoa học: "A Practical Solution to the Problem of Automatic Part-of-Speech Induction from Text" pdf

A Practical Solution to the Problem of Automatic Part-of-Speech Induction from Text Reinhard Rapp University of Mainz, FASK D-76711 Germersheim, Germany rapp@mail.fask.uni-mainz.de Abst

Trang 1

A Practical Solution to the Problem of Automatic Part-of-Speech Induction from Text

Reinhard Rapp

University of Mainz, FASK D-76711 Germersheim, Germany rapp@mail.fask.uni-mainz.de

Abstract

The problem of part-of-speech induction

from text involves two aspects: Firstly, a

set of word classes is to be derived

auto-matically Secondly, each word of a

vo-cabulary is to be assigned to one or

sev-eral of these word classes In this paper

we present a method that solves both

problems with good accuracy Our

ap-proach adopts a mixture of statistical

me-thods that have been successfully applied

in word sense induction Its main

advan-tage over previous attempts is that it

re-duces the syntactic space to only the most

important dimensions, thereby almost

eli-minating the otherwise omnipresent

prob-lem of data sparseness

1 Introduction

Whereas most previous statistical work concerning

parts of speech has been on tagging, this paper

deals with speech induction In

part-of-speech induction two phases can be distinguished:

In the first phase a set of word classes is to be

de-rived automatically on the basis of the distribution

of the words in a text corpus These classes should

be in accordance with human intuitions, i.e

com-mon distinctions such as nouns, verbs and

adjec-tives are desirable In the second phase, based on

its observed usage each word is assigned to one or

several of the previously defined classes

The main reason why part-of-speech induction

has received far less attention than part-of-speech

tagging is probably that there seemed no urgent

need for it as linguists have always considered

classifying words as one of their core tasks, and as

a consequence accurate lexicons providing such

information are readily available for many

lan-guages Nevertheless, deriving word classes

auto-matically is an interesting intellectual challenge

with relevance to cognitive science Also, advan-tages of the automatic systems are that they should

be more objective and can provide precise infor-mation on the likelihood distribution for each of a word’s parts of speech, an aspect that is useful for statistical machine translation

The pioneering work on class based n-gram models by Brown et al (1992) was motivated by such considerations In contrast, Schütze (1993) by applying a neural network approach put the em-phasis on the cognitive side More recent work in-cludes Clark (2003) who combines distributional and morphological information, and Freitag (2004) who uses a hidden Marcov model in combination with co-clustering

Most studies use abstract statistical measures such as perplexity or the F-measure for evaluation This is good for quantitative comparisons, but makes it difficult to check if the results agree with human intuitions In this paper we use a straight-forward approach for evaluation It involves check-ing if the automatically generated word classes agree with the word classes known from grammar books, and whether the class assignments for each

word are correct

2 Approach

In principle, word classification can be based on a number of different linguistic principles, e.g on phonology, morphology, syntax or semantics However, in this paper we are only interested in syntactically motivated word classes With syntac-tic classes the aim is that words belonging to the same class can substitute for one another in a sen-tence without affecting its grammaticality

As a consequence of the substitutability, when looking at a corpus words of the same class typi-cally have a high agreement concerning their left and right neighbors For example, nouns are

fre-quently preceded by words like a, the, or this, and succeeded by words like is, has or in In statistical

77

Trang 2

terms, words of the same class have a similar

fre-quency distribution concerning their left and right

neighbors To some extend this can also be

ob-served with indirect neighbors, but with them the

effect is less salient and therefore we do not

con-sider them here

The co-occurrence information concerning the

words in a vocabulary and their neighbors can be

stored in a matrix as shown in table 1 If we now

want to discover word classes, we simply compute

the similarities between all pairs of rows using a

vector similarity measure such as the cosine

coef-ficient and then cluster the words according to

these similarities The expectation is that

unambi-guous nouns like breath and meal form one cluster,

and that unambiguous verbs like discuss and

pro-tect form another cluster

Ambiguous words like link or suit should not

form a tight cluster but are placed somewhere in

between the noun and the verb clusters, with the

exact position depending on the ratios of the

occur-rence frequencies of their readings as either a noun

or a verb As this ratio can be arbitrary, according

to our experience ambiguous words do not

se-verely affect the clustering but only form some

uniform background noise which more or less

can-cels out in a large vocabulary.1 Note that the

cor-rect assignment of the ambiguous words to clusters

is not required at this stage, as this is taken care of

in the next step

This step involves computing the differential

vector of each word from the centroid of its closest

cluster, and to assign the differential vector to the

most appropriate other cluster This process can be

repeated until the length of the differential vector

falls below a threshold or, alternatively, the

agree-ment with any of the centroids becomes too low

This way an ambiguous word is assigned to several

parts of speech, starting from the most common

and proceeding to the least common Figure 1

il-lustrates this process

1

An alternative to relying on this fortunate but somewhat

un-satisfactory effect would be not to use global co-occurrence

vectors but local ones, as successfully proposed in word sense

induction (Rapp, 2004) This means that every occurrence of a

word obtains a separate row vector in table 1 The problem

with the resulting extremely sparse matrix is that most vectors

are either orthogonal to each other or duplicates of some other

vector, with the consequence that the dimensionality reduction

that is indispensable for such matrices does not lead to

sensi-ble results This prosensi-blem is not as severe in word sense

induc-tion where larger context windows are considered

The procedure that we described so far works in theory but not well in practice The problem with it

is that the matrix is so sparse that sampling errors have a strong negative effect on the results of the vector comparisons Fortunately, the problem of data sparseness can be minimized by reducing the dimensionality of the matrix An appropriate alge-braic method that has the capability to reduce the

dimensionality of a rectangular matrix is Singular Value Decomposition (SVD) It has the property

that when reducing the number of columns the similarities between the rows are preserved in the best possible way Whereas in other studies the reduction has typically been from several ten thou-sand to a few hundred, our reduction is from sev-eral ten thousand to only three This leads to a very strong generalization effect that proves useful for our particular task

left neighbors right neighbors

a we the you a can is well breath 11 0 18 0 0 14 19 0 discuss 0 17 0 10 9 0 0 8 link 14 6 11 7 10 9 14 3

protect 0 15 1 12 14 0 0 4

Table 1 Co-occurrence matrix of adjacent words

Figure 1 Constructing the parts of speech for can

3 Procedure

Our computations are based on the unmodified text

of the 100 million word British National Corpus

(BNC), i.e including all function words and with-out lemmatization By counting the occurrence frequencies for pairs of adjacent words we com-piled a matrix as exemplified in table 1 As this matrix is too large to be processed with our algo-rithms (SVD and clustering), we decided to restrict the number of rows to a vocabulary appropriate for evaluation purposes Since we are not aware of any standard vocabulary previously used in related work, we manually selected an ad hoc list of 50

Trang 3

words with BNC frequencies between 5000 and

6000 as shown in table 2 The choice of 50 was

motivated by the intention to give complete

clus-tering results in graphical form As we did not

want to deal with morphology, we used base forms

only Also, in order to be able to subjectively judge

the results, we only selected words where we felt

reasonably confident about their possible parts of

speech Note that the list of words was compiled

before the start of our experiments and remained

unchanged thereafter

The co-occurrence matrix based on the restricted

vocabulary and all neighbors occurring in the BNC

has a size of 50 rows times 28,443 columns As our

transformation function we simply use the

loga-rithm after adding one to each value in the matrix.2

As usual, the one is added for smoothing purposes

and to avoid problems with zero values We

de-cided not to use a sophisticated association

meas-ure such as the log-likelihood ratio because it has

an inappropriate value characteristic that prevents

the SVD, which is conducted in the next step, from

finding optimal dimensions.3

The purpose of the SVD is to reduce the number

of columns in our matrix to the main dimensions

However, it is not clear how many dimensions

should be computed Since our aim of identifying

basic word classes such as nouns or verbs requires

strong generalizations instead of subtle

distinc-tions, we decided to take only the three main

di-mensions into account, i.e the resulting matrix has

a size of 50 rows times 3 columns.4 The last step in

our procedure involves applying a clustering

algo-rithm to the 50 words corresponding to the rows in

the matrix We used hierarchical clustering with

average linkage, a linkage type that provides

con-siderable tolerance concerning outliers

4 Results and Evaluation

Our results are presented as dendrograms which in

contrast to 2-dimensional dot-plots have the

advan-tage of being able to correctly show the true

dis-tances between clusters The two dendrograms in

figure 2 where both computed by applying the

pro-cedure as described in the previous section, with

2 For arbitrary vocabularies the row vectors should be divided

by the corpus frequency of the corresponding word

3 We are currently investigating if replacing the log-likelihood

values by their ranks can solve this problem

4

Note that larger matrices can require a few more dimensions

the only difference that in generating the upper dendrogram the SVD-step has been omitted, whereas in generating the lower dendrogram it has been conducted Without SVD the expected clus-ters of verbs, nouns and adjectives are not clearly

separated, and the adjectives widely and rural are

placed outside the adjective cluster With SVD, all

50 words are in their appropriate clusters and the three discovered clusters are much more salient

Also, widely and rural are well within the adjective

cluster The comparison of the two dendrograms indicates that the SVD was capable of making ap-propriate generalizations Also, when we look in-side each cluster we can see that ambiguous words

like suit, drop or brief are somewhat closer to their

secondary class than unambiguous words

Having obtained the three expected clusters, the next investigation concerns the assignment of the ambiguous words to additional clusters As de-scribed previously, this is done by computing dif-ferential vectors, and by assigning these to the most similar other cluster Hereby for the cosine similarity we set a threshold of 0.8 That is, only if the similarity between the differential vector and its closest centroid was higher than 0.8 we as-signed the word to this cluster and continued to compute differential vectors Otherwise we as-sumed that the differential vector was caused by sampling errors and aborted the process of search-ing for additional class assignments

The results from this procedure are shown in ta-ble 2 where for each of the 50 words all computed classes are given in the order as they were obtained

by the algorithm, i.e the dominant assignments are listed first Although our algorithm does not name the classes, for simplicity we interpret them in the obvious way, i.e as nouns, verbs and adjectives A comparison with WordNet 2.0 choices is given in brackets For example, +N means that WordNet

lists the additional assignment noun, and -A indi-cates that the assignment adjective found by the

algorithm is not listed in WordNet

According to this comparison, for all 50 words the first reading is correct For 16 words an addi-tional second reading was computed which is cor-rect in 11 cases 16 of the WordNet assignments

are missing, among them the verb readings for re-form, suit, and rain and the noun reading for serve

However, as many of the WordNet assignments seem rare, it is not clear in how far the omissions can be attributed to shortcomings of the algorithm

Trang 4

accident N expensive A reform N (+V)

belief N familiar A (+N) rural A

birth N (+V) finance N V screen N (+V)

breath N grow V N (-N) seek V (+N)

brief A N imagine V serve V (+N)

broad A (+N) introduction N slow A V

busy A V link N V spring N A V (-A)

catch V N lovely A (+N) strike N V

critical A lunch N (+V) suit N (+V)

cup N (+V) maintain V surprise N V

dangerous A occur V N (-N) tape N V

discuss V option N thank V A (-A)

drop V N pleasure N thin A (+V)

drug N (+V) protect V tiny A

empty A V (+N) prove V widely A N (-N)

encourage V quick A (+N) wild A (+N)

establish V rain N (+V)

Table 2 Computed parts of speech for each word

5 Summary and Conclusions

This work was inspired by previous work on word

sense induction The results indicate that part of

speech induction is possible with good success

based on the analysis of distributional patterns in

text The study also gives some insight how SVD

is capable of significantly improving the results

Whereas in a previous paper (Rapp, 2004) we

found that for word sense induction the local

clus-tering of local vectors is more appropriate than the

global clustering of global vectors, for

part-of-speech induction our conclusion is that the

situa-tion is exactly the other way round, i.e the global clustering of global vectors is more adequate (see footnote 1) This finding is of interest when trying

to understand the nature of syntax versus semantics

if expressed in statistical terms

Acknowledgements

I would like to thank Manfred Wettler and Chris-tian Biemann for comments, Hinrich Schütze for the SVD-software, and the DFG (German Re-search Society) for financial support

References

Brown, Peter F.; Della Pietra, Vincent J.; deSouza, Peter V.; Lai, Jennifer C.; Mercer, Robert L (1992)

Class-based n-gram models of natural language Computa-tional Linguistics 18(4), 467-479

Clark, Alexander (2003) Combining distributional and morphological information for part of speech

induc-tion Proceedings of 10th EACL, Budapest, 59-66

Freitag, Dayne (2004) Toward unsupervised

whole-corpus tagging Proceedings of COLING, Geneva,

357-363

Rapp, Reinhard (2004) A practical solution to the

prob-lem of automatic word sense induction Proceedings

of ACL (Companion Volume), Barcelona, 195-198

Schütze, Hinrich (1993) Part-of-speech induction from

scratch Proceedings of ACL, Columbus, 251-258

0.8

0.4

0.0

1.0

0.5

0.0

Figure 2 Syntactic similarities with (lower dendrogram) and without SVD (upper dendrogram)

Định dạng
Số trang	4
Dung lượng	78,95 KB