Quantitative Text Analysis Using L-, F- and T-Segments 639Table 1.. Text numbers in the corpus with respect to genre and author Brentano Goethe Rilke Schnitzler 3 Distribution of segment
Trang 1Quantitative Text Analysis Using L-, F- and T-Segments 639
Table 1 Text numbers in the corpus with respect to genre and author
Brentano Goethe Rilke Schnitzler
3 Distribution of segment types
Starting from the hypothesis that L-, F- and T-segments are not only units whichare easily defined and easy to determine, but also posses a certain psychologicalreality i.e that they play a role in the process of text generation, it seems plausible toassume that these units display a lawful distributional behaviour similar to the well-known linguistic units such as words or syntactic constructions (c.f Köhler (1999))
A first confirmation - however on data from only a single Russian text - was found in(Köhler (2007)) A corresponding test on the data of the present study corroboratesthe hypothesis Each of the 66 texts shows a rank-frequency distribution of the 3kinds of segment patterns according to the Zipf-Mandelbrot distribution, which wasfitted to the data in the following form:
Figure 1 shows the fit of this distribution to the data of one of the texts on the basis of
Fig 1 Rank-Frequency Distribution of L-Segments
L-segments on a log-log scale In this case, the goodness-of-fit test yielded P(F2) ≈
Trang 2640 Reinhard Köhler and Sven Naumann
1.0 with 92 degrees of freedom N= 941 L-segments were found in the text forming
x max = 112 different patterns Similar results were obtained for all three kinds ofsegments and all texts Various experiments with the frequency distributions showpromising differences between authors and genres However, these differences alone
do not yet allow for a crisp discrimination
4 Length distribution of L-segments
As a consequence of our general hypothesis, not only the segment types but also thelength of the segments should follow lawful patterns Here, we study the distribution
of L-segment length First, a theoretical model is set up on the basis of three plausibleassumptions:
1 There is a tendency in natural language to form compact expressions This can beachieved at the cost of more complex constituents on the next level An example
is the following: The phrase "as a consequence" consists of 3 words, where the word "consequence" has 3 syllables The same idea can be expressed using the
shorter expression "consequently", which consists of only 1 word of 4 syllables.Hence, more compact expressions on one level go along with more complexexpressions on the next level Here, the consequence of the formation of longer
words is relevant The variable K will represent this tendency.
2 There is an opposed tendency, viz word length minimization It is a consequence
of the same tendency of effort minimization which is responsible for the firsttendency but now considered on the word level We will denote this requirement
by M.
3 The mean word length in a language can be considered as constant, at least for a
certain period of time This constant will be represented by q.
According to a general approach proposed by Altmann (cf Altmann and Köhler
(1996)) and substituting k = K −1 and m = M −1, the following equation can be set
0 =2F1(k,1;m;q) - the hyper-geometric function - as norming constant.
Here, (3) is used in a 1-displaced form because length 0 is not defined, i.e segments consisting of 0 words are impossible As this model is not likely to beadequate also for F- and T-segments - the requirements concerning the basic proper-ties frequency and polytextuality do not imply interactions between adjacent levels
L a simpler one can be set up Due to length limitations to our contribution in thisvolume we will not describe the appropriate model for these segment types but it
Trang 3Quantitative Text Analysis Using L-, F- and T-Segments 641can be said here that their length distributions can be modeled and explained by thehyper-Poisson distribution.
Fig 2 Theoretical and empirical distribution of L-segments in a poem
Fig 3 Theoretical and empirical distribution of L-segments in a short story
The empirical tests on the data from the 66 texts support our hypothesis with goodand very good 2 values Figures 2 and 3 show typical graphs of the theoretical andempirical distributions as modeled using the hyper-Pascal distribution Figure 2 is anexample of poetry; Figure 3 shows a narrative text Good indicators of text genre orauthors could not yet be found on the basis of these distributions However, only afew of the available characteristics have been considered so far The same is true ofthe corresponding experiments with F- and T-segments
Trang 4642 Reinhard Köhler and Sven Naumann
5 TTR studies
Another hypothesis investigated in our study is the assumption that the dynamic havior of the segments with respect to the increase of types in the course of the giventext, the so-called TTR, is analogous to that of words or other linguistic units WordTTR has the longest history; the large number of approaches presented in linguistics
be-is described and dbe-iscussed in (Altmann (1988), p 85-90), who gives also a retical derivation of the so-called Herdan model, the most commonly used one inlinguistics:
y = ax b e cx , c < 0. (5)
The value of a can be assumed to be equal to unity, because the first segment of a
text must be the first type, of course Therefore, we can remove this parameter fromthe model and simplify (4) as shown in (5):
y = e −c x b e cx = x b e c (x−1) , c < 0. (6)Figures 4 and 5 show the excellent fits of this model to data from one of the poemsand one of the prose texts Goodness-of-fit was determined using the determination
coefficient R2, which was above 0.99 in all 66 cases The parameters b and c of the
Trang 5Quantitative Text Analysis Using L-, F- and T-Segments 643
Fig 4 L-segment TTR of a poem
Fig 5 L-segment TTR of a short story
TTR model turned out to be quite promising characteristics of text genre and author.They are not likely to discriminate these factors sufficiently when taken alone butseem to carry a remarkable amount of information Figure 6 shows the relationship
between the parameters b and c.
Trang 6644 Reinhard Köhler and Sven Naumann
Fig 6 Relationship between the values of b and c in the corpus
6 Conclusion
Our study has shown that L-, F- and T-Segments on the word level display a lawfulbehavior in all aspects investigated so far and that some of the parameters, in partic-ular those of the TTR, seem promising for text classification Further investigations
on more languages and on more text genres will give more reliable answers to thesequestions
References
ALTMANN, G and KÖHLER, R (1996): "Language Forces? and Synergetic Modelling of
Language Phenomena In: P Schmidt [Ed.]: Glottometrika 15 Issues in General tic Theory and The Theory of Word Length WVT, Trier, 62-76.
Linguis-ANDERSEN, S (2005): Word length balance in texts: Proportion constancy and lengths in Proust’s longest sentence Glottometrics 11, 32-50
word-chain-BORODA, M (1982): Häufigkeitsstrukturen musikalischer Texte In: J Orlov, M Boroda, G
Moisei and I Nadarejˆsvili [Eds.]: Sprache, Text, Kunst Quantitative Analysen
KÖHLER, R and G ALTMANN (2000): Probability Distributions of Syntactic Units and
Properties Journal of Quantitative Linguistics 7/3, S.189-200.
KÖHLER, R (2006b): Word length in text A study in the syntagmatic dimension To appear
Trang 7Quantitative Text Analysis Using L-, F- and T-Segments 645KÖHLER, R (2006a): The frequency distribution of the lengths of length sequences In: J.
Genzor and M Bucková [Eds.]: Favete linguis Studies in honour of Victor Krupa Slovak
Academic Press, Bratislava, 145-152
UHLÍHOVÁ, L (2007): Word frequency and position in sentence To appear
WIMMER, G and ALTMANN, G (1999): Thesaurus of Univariate Discrete Probability tributions Stamm, Essen
Trang 8Dis-Structural Differentiae of Text Types – A Quantitative
Abstract The categorization of natural language texts is a well established research field in
computational and quantitative linguistics (Joachims 2002) In the majority of cases, the vector
space model is used in terms of a bag of words approach That is, lexical features are extracted
from input texts in order to train some categorization model and, thus, to attribute, for ple, authorship or topic categories Parallel to these approaches there has been some effort inperforming text categorization not in terms of lexical, but of structural features of documentstructure More specifically, quantitative text characteristics have been computed in order toderive a sort of structural text signature which nevertheless allows reliable text categorizations
exam-(Kelih & Grzybek 2005; Pieper 1975) This “bag of features” approach regains attention when
it comes to categorizing websites and other document types whose structure is far away fromthe simplicity of tree-like structures Here we present a novel approach to structural classifierswhich systematically computes structural signatures of documents In summary, we present
a text categorization algorithm which in the absence of any lexical features nevertheless forms a remarkably good classification even if the classes are thematically defined
per-1 Introduction
An alternative way to categorize documents apart from the well established “ bag of
words” approach is to categorize by means of structural features This approach
func-tions in absence of any lexical information utilizing quantitative characteristics ofdocuments computed from the logical document structure.1That means that markerslike content words are completely disregarded Features like distributions of sections,paragraphs, sentence length etc are considered instead
Capturing structural properties to build a classifier assumes that given categoryseparations are reflected by structural differences According to Biber (1995) we canexpect that functional differences correlate with structural and formal representa-
tions of text types This may explain good overall results in terms of F-Measure2
1See also Mehler et al (2006)
2The harmonic mean of precision and recall is used here to measure the overall success of
the classification
Trang 9656 Olga Pustylnikov and Alexander Mehler
However, the F-Measure gives no information about the quality of the investigated
categories That is, no a prior knowledge about the suitability of the categories forrepresenting homogenous classes and for applying them in machine learning tasks isprovided Since natural language categories e.g in form of web documents or othertextual units arise not necessarily with a well defined structural representation avail-able it is important to know how the classifier behaves dealing with such categories.Here, we investigate a large number of existing categories, thematic classes or
rubrics taken from a 10 years newspaper corpus of Süddeutsche Zeitung (SZ 2004)
whereas a rubric represents a recurrent part of the newspaper like `sportst’ or
ask-ing more specifically for a maximal subset of all rubrics which gives an F-Measure above a predefined cut-off c ∈ [0,1] (e.g c = 0.9) We evaluate the classifier in the
way allowing to exclude possible drawbacks with respect to:
• the categorization model used (here SVM3and Cluster Analysis),4
• the text representation model used (here the bag of features approach) and
• the structural homogeneity of categories used
The first point relates to distinguishing supervised and unsupervised learning That
is, we perform these sorts of learning although we do not systematically evaluatethem comparativelywith respect to all possible parameters Rather, we investigatethe potential of our features evaluating them with respect to both scenarios Therepresentation format (vector representation) is restricted by the model used (e.g
SVM) Thus, we concentrate on the third point and apply an iterative categorization
procedure (ICP)5to explore the structural suitability of categories In summary, ourexperiments have twofold goals:
1 to study given categories using the ICP in order to filter out structurally sistent types and
incon-2 to make judgements about the structural classifier’s behavior dealing with gories of different size and quality levels
cate-2 Category selection
The 10 years corpus of the SZ used in the present study contains 95 different rubrics.The frequency distribution of these rubrics shows an enormous inequality for thewhole set (See Figure 1) In order to minimize the calculation effort we reduce theinitial set of 95 rubrics to a smaller subset according to the following criteria
1 First, we compute the mean z and the standard deviation V for the whole set.
3Support Vector Machines
4Supervised vs unsupervised respectively
5See sec 4
Trang 10Structural Differentiae of Text Types – A Quantitative Model 657
Fig 1 Categories/Articles-Distribution of 95 Rubrics of SZ.
2 Second, we pick out all rubrics R with the cardinality |R| (the number of
exam-ples within the corpus) ranging between the interval:
z − V/2 < |R| < z + V/2
This selection method allows to specify a window around the mean value of all uments leaving out the unusual cases.6Thus, the resulting subset of 68 categories isselected
doc-3 The evaluation procedure
The data representation format for the subset of rubrics uses a vector representation
(bag of features approach) where each document is represented by a feature vector.7
The vectors are calculated as structural signatures of the underlying documents Toavoid drawbacks (See Sec 1) caused by the evaluation method in use, we comparethree different categorization scenarios:
1 Supervised scenario by means of SVM-light8,
2 Unsupervised scenario in terms of Cluster Analysis and
3 Finally, a baseline experiment based on random clustering
6The method is taken from Bock (1974) Rieger (1989) uses it to identify above-averageagglomeration steps in the clustering framework Gleim et al (2007) successfully appliedthe method to develop quality filters for wiki articles
7See Mehler et al (2007) for a formalization of this approach
8Joachims (2002)
Trang 11658 Olga Pustylnikov and Alexander Mehler
Consider an input corpus K and a set of categoriesC with the number of categories
|C| = n Then we proceed as follows to evaluate our various learning scenarios:
• For the supervised case we train a binary classifier by treating the negative
ex-amples of a category Ci ∈ C as K \ [C i] and the positive examples as a subset
[Ci] ⊆ K The subsets Ciare in this experiment pairwise disjunct and we define
L = {[Ci]|Ci ∈ C} as a partition of positive and negative examples of C i. Classification results are obtained in terms of precision and recall We calculate the F-score for a class C iin the following way:
F i= 1 2
recall i+ 1 precision i
In the next step we compute the weighted mean for all categoriesof the partition
Lin order to judge about the overall separability of given text types using the
unsu-case best clustering results are presented in terms of F-Measure values.
• Finally, the random baseline is calculated by preserving the original categorysizes and by mapping articles randomly to them Results of random clusteringhelp to check the success of both learning scenarios Thus, clusterings close tothe random baseline indicate either a failure of the cluster algorithm or that theseparability of the text types can’t be well separated by structure
In summary, we check the performance of structural signatures within two learningscenarios – supervised and unsupervised – and compare the results with the random
clustering baseline Next Section describes the incremental categorization procedure
(ICP) to investigate the structural homogeneity of categories
4 Exploring the structural homogeneity of text types by means of the Iterative Categorisation Procedure (ICP)
In this Section we return to the question mentioned at the beginning Given a
cut-off c ∈ [0,1] (e.g c = 0.9) we ask for the maximal subset of rubrics allowing to
achieve an F-Measure value F > c Decreasing the cut-off c successively we get a
rank ordering of rubrics ranging from the best contributors to the worst ones The
ICP allows to determine a result set of maximal size n with the maximal internal
homogeneity compared to all candidate sets in question Starting with a given set ofinput categories to be learned we proceed as follows:
Trang 12Structural Differentiae of Text Types – A Quantitative Model 659
1 Start: Select a seed category C ∈ A and set A1= {C} The rank r of C equals
r (C) = 1 Now repeat:
2 Iteration(i > 1): Let B = A\Ai−1 Select the category C ∈ B which when added
to A i−1 maximizes the F-Measure value among all candidate extensions of A i−1
by means of single categories of B Set Ai = A i−1 ∪ {C} and r(C) = i.
3 Break off: The iteration algorithm terminates if either
i) A \ Ai= or
ii) the F-Measure value of A iis smaller than a predefined cut-off or
iii) the F-Measure value of A iis smaller than the one of the operative baseline
If none of these stop conditions holds repeat step (2)
The kind of ranking described here is more informative than the F-Measure value alone That is, the F-Measure gives global information about the overall separabil-
ity of categories The ICP in contrast, provides additional local information aboutthe weights of single categories with respect to the overall performance This in-formation allows to check the suitability of single categories to serve as structuralprototypes Knowledge about the homogeneity of each category provides a deeperinsight into the possibilities of our approach
In the next Section the rankings of the ICP applied to supervised and vised learning and compared with the random clustering baseline are presented Inorder to exclude a dependence of the structural approach on one of the learning me-
unsuper-thods, we also apply the best-of-unsupervised-ranking to the supervised scenario and
compare the outcomes That means, we use exactly the same range having performedbest in the unsupervised experiment for SVM learning
5 Results
Table 1 gives an overview about the categories used From the total number of 95rubrics 68 were selected using the selection method described in Section 2, 55 wereconsidered in unsupervised, 16 in supervised experiments The common subset used
in both cases consists of 14 categories
The Y-axis of Figure 2 represents the F-Measure values and the X-axis the rank
order of categories iteratively added to the seed set The supervised scenario (uppercurve) performs best ranging around the value of 1.0 The values of the unsupervisedcase decrease more rapidly (the third curve from above) The unsupervised best-of-ranking categorized with the supervised method (second curve from above) liesbetween the best results of the two methods The lower curve represents the results
of random clustering
6 Discussion
According to Figure 2 we can see, that all F-Measure results lie high above the
baseline of random clustering All the subsets are well separated by their document