Data Analysis Machine Learning and Applications Episode 3 Part 6 doc

Quantitative Text Analysis Using L-, F- and T-Segments 639Table 1.. Text numbers in the corpus with respect to genre and author Brentano Goethe Rilke Schnitzler 3 Distribution of segment

Trang 1

Quantitative Text Analysis Using L-, F- and T-Segments 639

Table 1 Text numbers in the corpus with respect to genre and author

Brentano Goethe Rilke Schnitzler

3 Distribution of segment types

Starting from the hypothesis that L-, F- and T-segments are not only units whichare easily defined and easy to determine, but also posses a certain psychologicalreality i.e that they play a role in the process of text generation, it seems plausible toassume that these units display a lawful distributional behaviour similar to the well-known linguistic units such as words or syntactic constructions (c.f Köhler (1999))

A first confirmation - however on data from only a single Russian text - was found in(Köhler (2007)) A corresponding test on the data of the present study corroboratesthe hypothesis Each of the 66 texts shows a rank-frequency distribution of the 3kinds of segment patterns according to the Zipf-Mandelbrot distribution, which wasfitted to the data in the following form:

Figure 1 shows the fit of this distribution to the data of one of the texts on the basis of

Fig 1 Rank-Frequency Distribution of L-Segments

L-segments on a log-log scale In this case, the goodness-of-fit test yielded P(F2) ≈

Trang 2

640 Reinhard Köhler and Sven Naumann

1.0 with 92 degrees of freedom N= 941 L-segments were found in the text forming

x max = 112 different patterns Similar results were obtained for all three kinds ofsegments and all texts Various experiments with the frequency distributions showpromising differences between authors and genres However, these differences alone

do not yet allow for a crisp discrimination

4 Length distribution of L-segments

As a consequence of our general hypothesis, not only the segment types but also thelength of the segments should follow lawful patterns Here, we study the distribution

of L-segment length First, a theoretical model is set up on the basis of three plausibleassumptions:

1 There is a tendency in natural language to form compact expressions This can beachieved at the cost of more complex constituents on the next level An example

is the following: The phrase "as a consequence" consists of 3 words, where the word "consequence" has 3 syllables The same idea can be expressed using the

shorter expression "consequently", which consists of only 1 word of 4 syllables.Hence, more compact expressions on one level go along with more complexexpressions on the next level Here, the consequence of the formation of longer

words is relevant The variable K will represent this tendency.

2 There is an opposed tendency, viz word length minimization It is a consequence

of the same tendency of effort minimization which is responsible for the firsttendency but now considered on the word level We will denote this requirement

by M.

3 The mean word length in a language can be considered as constant, at least for a

certain period of time This constant will be represented by q.

According to a general approach proposed by Altmann (cf Altmann and Köhler

(1996)) and substituting k = K −1 and m = M −1, the following equation can be set

0 =2F1(k,1;m;q) - the hyper-geometric function - as norming constant.

Here, (3) is used in a 1-displaced form because length 0 is not defined, i.e segments consisting of 0 words are impossible As this model is not likely to beadequate also for F- and T-segments - the requirements concerning the basic proper-ties frequency and polytextuality do not imply interactions between adjacent levels

L a simpler one can be set up Due to length limitations to our contribution in thisvolume we will not describe the appropriate model for these segment types but it

Trang 3

Quantitative Text Analysis Using L-, F- and T-Segments 641can be said here that their length distributions can be modeled and explained by thehyper-Poisson distribution.

Fig 2 Theoretical and empirical distribution of L-segments in a poem

Fig 3 Theoretical and empirical distribution of L-segments in a short story

The empirical tests on the data from the 66 texts support our hypothesis with goodand very good 2 values Figures 2 and 3 show typical graphs of the theoretical andempirical distributions as modeled using the hyper-Pascal distribution Figure 2 is anexample of poetry; Figure 3 shows a narrative text Good indicators of text genre orauthors could not yet be found on the basis of these distributions However, only afew of the available characteristics have been considered so far The same is true ofthe corresponding experiments with F- and T-segments

Trang 4

5 TTR studies

Another hypothesis investigated in our study is the assumption that the dynamic havior of the segments with respect to the increase of types in the course of the giventext, the so-called TTR, is analogous to that of words or other linguistic units WordTTR has the longest history; the large number of approaches presented in linguistics

be-is described and dbe-iscussed in (Altmann (1988), p 85-90), who gives also a retical derivation of the so-called Herdan model, the most commonly used one inlinguistics:

y = ax b e cx , c < 0. (5)

The value of a can be assumed to be equal to unity, because the first segment of a

text must be the first type, of course Therefore, we can remove this parameter fromthe model and simplify (4) as shown in (5):

y = e −c x b e cx = x b e c (x−1) , c < 0. (6)Figures 4 and 5 show the excellent fits of this model to data from one of the poemsand one of the prose texts Goodness-of-fit was determined using the determination

coefficient R2, which was above 0.99 in all 66 cases The parameters b and c of the

Trang 5

Quantitative Text Analysis Using L-, F- and T-Segments 643

Fig 4 L-segment TTR of a poem

Fig 5 L-segment TTR of a short story

TTR model turned out to be quite promising characteristics of text genre and author.They are not likely to discriminate these factors sufficiently when taken alone butseem to carry a remarkable amount of information Figure 6 shows the relationship

between the parameters b and c.

Trang 6

Fig 6 Relationship between the values of b and c in the corpus

6 Conclusion

Our study has shown that L-, F- and T-Segments on the word level display a lawfulbehavior in all aspects investigated so far and that some of the parameters, in partic-ular those of the TTR, seem promising for text classification Further investigations

on more languages and on more text genres will give more reliable answers to thesequestions

References

ALTMANN, G and KÖHLER, R (1996): "Language Forces? and Synergetic Modelling of

Language Phenomena In: P Schmidt [Ed.]: Glottometrika 15 Issues in General tic Theory and The Theory of Word Length WVT, Trier, 62-76.

Linguis-ANDERSEN, S (2005): Word length balance in texts: Proportion constancy and lengths in Proust’s longest sentence Glottometrics 11, 32-50

word-chain-BORODA, M (1982): Häufigkeitsstrukturen musikalischer Texte In: J Orlov, M Boroda, G

Moisei and I Nadarejˆsvili [Eds.]: Sprache, Text, Kunst Quantitative Analysen

KÖHLER, R and G ALTMANN (2000): Probability Distributions of Syntactic Units and

Properties Journal of Quantitative Linguistics 7/3, S.189-200.

KÖHLER, R (2006b): Word length in text A study in the syntagmatic dimension To appear

Trang 7

Quantitative Text Analysis Using L-, F- and T-Segments 645KÖHLER, R (2006a): The frequency distribution of the lengths of length sequences In: J.

Genzor and M Bucková [Eds.]: Favete linguis Studies in honour of Victor Krupa Slovak

Academic Press, Bratislava, 145-152

UHLÍHOVÁ, L (2007): Word frequency and position in sentence To appear

WIMMER, G and ALTMANN, G (1999): Thesaurus of Univariate Discrete Probability tributions Stamm, Essen

Trang 8

Dis-Structural Differentiae of Text Types – A Quantitative

Abstract The categorization of natural language texts is a well established research field in

computational and quantitative linguistics (Joachims 2002) In the majority of cases, the vector

space model is used in terms of a bag of words approach That is, lexical features are extracted

from input texts in order to train some categorization model and, thus, to attribute, for ple, authorship or topic categories Parallel to these approaches there has been some effort inperforming text categorization not in terms of lexical, but of structural features of documentstructure More specifically, quantitative text characteristics have been computed in order toderive a sort of structural text signature which nevertheless allows reliable text categorizations

exam-(Kelih & Grzybek 2005; Pieper 1975) This “bag of features” approach regains attention when

it comes to categorizing websites and other document types whose structure is far away fromthe simplicity of tree-like structures Here we present a novel approach to structural classifierswhich systematically computes structural signatures of documents In summary, we present

a text categorization algorithm which in the absence of any lexical features nevertheless forms a remarkably good classification even if the classes are thematically defined

per-1 Introduction

An alternative way to categorize documents apart from the well established “ bag of

words” approach is to categorize by means of structural features This approach

func-tions in absence of any lexical information utilizing quantitative characteristics ofdocuments computed from the logical document structure.1That means that markerslike content words are completely disregarded Features like distributions of sections,paragraphs, sentence length etc are considered instead

Capturing structural properties to build a classifier assumes that given categoryseparations are reflected by structural differences According to Biber (1995) we canexpect that functional differences correlate with structural and formal representa-

tions of text types This may explain good overall results in terms of F-Measure2

1See also Mehler et al (2006)

2The harmonic mean of precision and recall is used here to measure the overall success of

the classification

Trang 9

656 Olga Pustylnikov and Alexander Mehler

However, the F-Measure gives no information about the quality of the investigated

categories That is, no a prior knowledge about the suitability of the categories forrepresenting homogenous classes and for applying them in machine learning tasks isprovided Since natural language categories e.g in form of web documents or othertextual units arise not necessarily with a well defined structural representation avail-able it is important to know how the classifier behaves dealing with such categories.Here, we investigate a large number of existing categories, thematic classes or

rubrics taken from a 10 years newspaper corpus of Süddeutsche Zeitung (SZ 2004)

whereas a rubric represents a recurrent part of the newspaper like `sportst’ or

ask-ing more specifically for a maximal subset of all rubrics which gives an F-Measure above a predefined cut-off c ∈ [0,1] (e.g c = 0.9) We evaluate the classifier in the

way allowing to exclude possible drawbacks with respect to:

• the categorization model used (here SVM3and Cluster Analysis),4

• the text representation model used (here the bag of features approach) and

• the structural homogeneity of categories used

The first point relates to distinguishing supervised and unsupervised learning That

is, we perform these sorts of learning although we do not systematically evaluatethem comparativelywith respect to all possible parameters Rather, we investigatethe potential of our features evaluating them with respect to both scenarios Therepresentation format (vector representation) is restricted by the model used (e.g

SVM) Thus, we concentrate on the third point and apply an iterative categorization

procedure (ICP)5to explore the structural suitability of categories In summary, ourexperiments have twofold goals:

1 to study given categories using the ICP in order to filter out structurally sistent types and

incon-2 to make judgements about the structural classifier’s behavior dealing with gories of different size and quality levels

cate-2 Category selection

The 10 years corpus of the SZ used in the present study contains 95 different rubrics.The frequency distribution of these rubrics shows an enormous inequality for thewhole set (See Figure 1) In order to minimize the calculation effort we reduce theinitial set of 95 rubrics to a smaller subset according to the following criteria

1 First, we compute the mean z and the standard deviation V for the whole set.

3Support Vector Machines

4Supervised vs unsupervised respectively

5See sec 4

Trang 10

Structural Differentiae of Text Types – A Quantitative Model 657

Fig 1 Categories/Articles-Distribution of 95 Rubrics of SZ.

2 Second, we pick out all rubrics R with the cardinality |R| (the number of

exam-ples within the corpus) ranging between the interval:

z − V/2 < |R| < z + V/2

This selection method allows to specify a window around the mean value of all uments leaving out the unusual cases.6Thus, the resulting subset of 68 categories isselected

doc-3 The evaluation procedure

The data representation format for the subset of rubrics uses a vector representation

(bag of features approach) where each document is represented by a feature vector.7

The vectors are calculated as structural signatures of the underlying documents Toavoid drawbacks (See Sec 1) caused by the evaluation method in use, we comparethree different categorization scenarios:

1 Supervised scenario by means of SVM-light8,

2 Unsupervised scenario in terms of Cluster Analysis and

3 Finally, a baseline experiment based on random clustering

6The method is taken from Bock (1974) Rieger (1989) uses it to identify above-averageagglomeration steps in the clustering framework Gleim et al (2007) successfully appliedthe method to develop quality filters for wiki articles

7See Mehler et al (2007) for a formalization of this approach

8Joachims (2002)

Trang 11

658 Olga Pustylnikov and Alexander Mehler

Consider an input corpus K and a set of categoriesC with the number of categories

|C| = n Then we proceed as follows to evaluate our various learning scenarios:

• For the supervised case we train a binary classifier by treating the negative

ex-amples of a category Ci ∈ C as K \ [C i] and the positive examples as a subset

[Ci] ⊆ K The subsets Ciare in this experiment pairwise disjunct and we define

L = {[Ci]|Ci ∈ C} as a partition of positive and negative examples of C i. Classification results are obtained in terms of precision and recall We calculate the F-score for a class C iin the following way:

F i= 1 2

recall i+ 1 precision i

In the next step we compute the weighted mean for all categoriesof the partition

Lin order to judge about the overall separability of given text types using the

unsu-case best clustering results are presented in terms of F-Measure values.

• Finally, the random baseline is calculated by preserving the original categorysizes and by mapping articles randomly to them Results of random clusteringhelp to check the success of both learning scenarios Thus, clusterings close tothe random baseline indicate either a failure of the cluster algorithm or that theseparability of the text types can’t be well separated by structure

In summary, we check the performance of structural signatures within two learningscenarios – supervised and unsupervised – and compare the results with the random

clustering baseline Next Section describes the incremental categorization procedure

(ICP) to investigate the structural homogeneity of categories

4 Exploring the structural homogeneity of text types by means of the Iterative Categorisation Procedure (ICP)

In this Section we return to the question mentioned at the beginning Given a

cut-off c ∈ [0,1] (e.g c = 0.9) we ask for the maximal subset of rubrics allowing to

achieve an F-Measure value F > c Decreasing the cut-off c successively we get a

rank ordering of rubrics ranging from the best contributors to the worst ones The

ICP allows to determine a result set of maximal size n with the maximal internal

homogeneity compared to all candidate sets in question Starting with a given set ofinput categories to be learned we proceed as follows:

Trang 12

Structural Differentiae of Text Types – A Quantitative Model 659

1 Start: Select a seed category C ∈ A and set A1= {C} The rank r of C equals

r (C) = 1 Now repeat:

2 Iteration(i > 1): Let B = A\Ai−1 Select the category C ∈ B which when added

to A i−1 maximizes the F-Measure value among all candidate extensions of A i−1

by means of single categories of B Set Ai = A i−1 ∪ {C} and r(C) = i.

3 Break off: The iteration algorithm terminates if either

i) A \ Ai= or

ii) the F-Measure value of A iis smaller than a predefined cut-off or

iii) the F-Measure value of A iis smaller than the one of the operative baseline

If none of these stop conditions holds repeat step (2)

The kind of ranking described here is more informative than the F-Measure value alone That is, the F-Measure gives global information about the overall separabil-

ity of categories The ICP in contrast, provides additional local information aboutthe weights of single categories with respect to the overall performance This in-formation allows to check the suitability of single categories to serve as structuralprototypes Knowledge about the homogeneity of each category provides a deeperinsight into the possibilities of our approach

In the next Section the rankings of the ICP applied to supervised and vised learning and compared with the random clustering baseline are presented Inorder to exclude a dependence of the structural approach on one of the learning me-

unsuper-thods, we also apply the best-of-unsupervised-ranking to the supervised scenario and

compare the outcomes That means, we use exactly the same range having performedbest in the unsupervised experiment for SVM learning

5 Results

Table 1 gives an overview about the categories used From the total number of 95rubrics 68 were selected using the selection method described in Section 2, 55 wereconsidered in unsupervised, 16 in supervised experiments The common subset used

in both cases consists of 14 categories

The Y-axis of Figure 2 represents the F-Measure values and the X-axis the rank

order of categories iteratively added to the seed set The supervised scenario (uppercurve) performs best ranging around the value of 1.0 The values of the unsupervisedcase decrease more rapidly (the third curve from above) The unsupervised best-of-ranking categorized with the supervised method (second curve from above) liesbetween the best results of the two methods The lower curve represents the results

of random clustering

6 Discussion

According to Figure 2 we can see, that all F-Measure results lie high above the

baseline of random clustering All the subsets are well separated by their document

Định dạng
Số trang	25
Dung lượng	649,06 KB