1. Trang chủ
  2. » Luận Văn - Báo Cáo

Tài liệu Báo cáo khoa học: "Automatic Authorship Attribution" ppt

7 339 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 7
Dung lượng 559,65 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Besides the style markers relevant to the output of this tool we also use analysis-dependent style markers, that is, measures that represent the way in which the text has been processed.

Trang 1

A u t o m a t i c A u t h o r s h i p A t t r i b u t i o n

E Stamatatos, N Fakotakis and G Kokkinakis Dept of Electrical and Computer Engineering

University of Patras

26500 - Patras GREECE stamatatos@wcl.ee.upatras.gr

Abstract

In this paper we present an approach to

automatic authorship attribution dealing

with real-world (or unrestricted) text

Our method is based on the

computational analysis of the input text

using a text-processing tool Besides the

style markers relevant to the output of

this tool we also use analysis-dependent

style markers, that is, measures that

represent the way in which the text has

been processed No word frequency

counts, nor other lexically-based

measures are taken into account We

show that the proposed set of style

markers is able to distinguish texts of

various authors of a weekly newspaper

using multiple regression All the

experiments we present were performed

using real-world text downloaded from

the World Wide Web Our approach is

easily trainable and fully-automated

requiring no manual text preprocessing

nor sampling

1 Introduction

The vast majority of the attempts to computer-

assisted authorship attribution has been focused

on literary texts In particular, a lot of attention

has been paid to the establishment of the

authorship of anonymous or doubtful texts A

typical paradigm is the case of the Federalist

papers twelve of which are of disputed

authorship (Mosteller and Wallace, 1984;

Holmes and Forsyth, 1995) Moreover, the lack

of a generic and formal definition of the

idiosyncratic style of an author has led to the

employment of statistical methods (e.g.,

discriminant analysis, principal components, etc.) Nowadays, the wealth of text available in the World Wide Web in electronic form for a wide variety o f genres and languages, as well as the development o f reliable text-processing tools open the way for the solution o f the authorship attribution problem as regards real-world text The most important approaches to authorship attribution involve lexically based measures A lot of style markers have been proposed for measuring the richness of the vocabulary used

by the author For example, the type-token ratio,

the hapax legomena (i.e., once-occurring words), the hapax dislegomena (i.e., twice- occurring words), etc There are also functions that make use of these measures such as Yule's

K (Yule, 1944), Honore's R (Honore, 1979), etc

A review of this metrics can be found in (Holmes, 1994) In (Holmes and Forsyth, 1994) five vocabulary richness functions were used in the framework of a multivariate statistical analysis of the Federalist papers and a principal components analysis was performed All the disputed papers lie in the side of James Madison (rather than Alexander Hamilton) in the space of the first two principal components However, such measures require the development of large lexicons with specialized information in order to detect the various forms of the lexical units that constitute an author's vocabulary For languages with a rich morphology, i.e Modem Greek, this

is an important shortcoming

Instead of counting how many words occur certain number of times, Burrows (1987) proposed the use of a set of common function (or context-free) word frequencies in the sample text This method combined with a principal components analysis achieved remarkable results when applied to a wide variety of authors (Burrows, 1992) On the other hand, a lot of

Trang 2

effort is required regarding the selection of the

most appropriate set o f words that best

distinguish a given set o f authors (Holmes and

Forsyth, 1995) Moreover, all the lexically-

based style markers are highly author and

language dependent The results of a work using

such measures, therefore, can not be applied to a

different group of authors nor another language

In order to avoid the problems of lexically-

based measures, (Baayen, et al., 1996) proposed

the use of syntax-based ones This approach is

based on the frequencies of the rewrite rules as

they appear in a syntactically annotated corpus

Both high-frequent and low-frequent rewrite

rules give accuracy results comparable to

lexically-based methods However, the

computational analysis is considered as a

significant limitation of this method since the

required syntactic annotation scheme is very

complicated and current text-processing tools

are not capable of providing automatically such

information, especially in the case of

unrestricted text

To the best of our knowledge, there is no

computational system for the automatic

detection of authorship dealing with real-world

text In t h i s p a p e r , we present an approach to

this problem In particular, our aim is the

discrimination between the texts of various

authors of a Modem Greek weekly newspaper

We use an already existing text processing tool

able to detect sentence and chunk boundaries in

unrestricted text for the extraction of style

markers Instead of trying to minimize the

computational analysis of the text, we attempt to

take advantage of this procedure In particular,

we use a set of analysis-level style markers, i.e.,

measures that represent the way in which the

text has been processed by the tool For

example, a useful measure is the percentage of

the sample text remaining unanalyzed after the

automatic processing In other words, we

attempt to adapt the set of the style markers to

the method used by the sentence and chunk

detector in order to analyze the sample text The

statistical technique of multiple regression is,

then, used for extracting a linear combination of

the values of the style markers that manages to

distinguish the different authors The

experiments we present, for both author

identification and author verification tasks, were

performed using real-world text downloaded

from the World Wide Web Our approach is easily trainable and fully automated requiring no manual text preprocessing nor sampling

A brief description of the extraction of the style markers is given in section 2 Section 3 describes the composition o f the corpus of real- world text used in the experiments The training procedure is given in section 4 while section 5 comprises analytical experimental results Finally, in section 6 some conclusions are drawn and future work directions are given

2 Extraction of Style Markers

As aforementioned, an already existing tool is used for the extraction of the style markers This tool is a Sentence and Chunk Boundaries Detector (SCBD) able to deal with unrestricted

M o d e m Greek text (Stamatatos, et aL,

forthcoming) Initially, SCBD segments the input text into sentences using a set of disambiguation rules, and then detects the boundaries of intrasentential phrases (i.e., chunks) such as noun phrases, prepositional phrases, etc It has to be noted that SCBD makes use of no complicated resources (e.g., large lexicons) Rather, it is based on c o m m o n word suffixes and a set o f keywords in order to detect the chunk boundaries using empirically derived rules A sample of its output is given below: VP[Aev 0~ko~ va p ~ ] NP[XdSt] PP[tm 1 q0co~td] CON[akkd] VP[ma~m3co] CON[6~t] NP[I] sml3dpvvml] PP[oxov npoiJ~oko'/togr] PP[a~6 zoa)q 13ovksm~q] VP[Sev gnopei va xpoagezpeixat] g6vo PP[ge za "5 *Sto *Spz zcov

ctvaSpogtKrbv] x o v NP[xqlpav ze)~evzai.a] VP[xpo,:a~.rbv~aq] NP[vr I 5voq0opia "Crlq KotvClq $vcbgrlq]

Based on the output of this tool, the following measures are provided:

punctuation mark count, etc

included in noun phrases count prepositional phrase count, word included

in prepositional phrases count etc

In addition, we use measures relevant to the computational analysis of the input text:

Trang 3

Table 1 The Corpus Consisting of Texts Taken from the Weekly Newspaper TO BHMA

C o d e

A01

A02

A03

A04

A05

A06

A07

A08

A09

A10

K Tsoukalas 20 30,316 International affairs

each pass, keyword count, non-matching

word count, and assigned morphological

descriptions for both words and chunks

The latter measures can be calculated only

when this particular computational tool is

utilized In more detail, SCBD performs

multiple pass parsing (i.e., 5 passes) Each

parsing pass analyzes a part of the sentence,

based on the results o f the previous passes, and

the remaining part is kept for the subsequent

passes The first passes try to detect the simplest

cases of the chunk boundaries which are easily

recognizable while the last ones deal with more

complicated cases using the findings of the

previous passes The percentage of the words

remaining unanalyzed after each parsing pass,

therefore, is an important stylistic factor that

represents the syntactic complexity o f the text

Additionally, the measure o f the detected

keywords and the detected words that do not

match any of the stored suffixes include crucial

stylistic information

The vast majority of the natural language

processing tools can provide analysis-level style

markers However, the manner o f capturing the

stylistic information may differ since it depends

on the method of analysis

In order to normalize the calculated style

markers we make use of ratios of them (e.g.,

words / sentences, noun phrases / total detected

chunks, words remaining unanalyzed after

parsing pass 1 / words, etc.) The total set of

style markers comprises 22 markers, namely: 3

token-level, 10 phrase-level, and 9 analysis-level

ones

3 Corpus

The corpus used for this study consists of texts downloaded from the World Wide Web-site of

the Modem Greek weekly newspaper TO BHMA

(Dolnet, 1998) This newspaper comprises several supplements We chose to deal with

authors of the supplement B, entitled NEEZ EHOXEZ (i.e., new ages), which comprises essays on science, culture, history, etc since in such writings the indiosyncratic style o f the author is not likely to be overshadowed by the characteristics of the corresponding text-genre

In general, the texts included in the supplement

B are written by scholars, writers, etc., rather than journalists Moreover, there is a closed set

of authors that regularly publish their writings in the pages of this supplement The collection of a considerable amount of texts by an author was, therefore, possible

Initially, we selected l0 authors whose writings are frequently published in this supplement No special criteria have been taken into account Then, 20 texts o f each author were downloaded from the Web-site of the newspaper No manual text preprocessing nor text sampling was performed aside from removing unnecessary headings All the downloaded texts were taken from issues published during 1998 in order to minimize the potential change of the personal style o f an author over time Some statistics of the downloaded corpus are shown in table 1 The last column of this table refers to the thematic area of the majority of the writings of each author Notice that this information was not

Trang 4

taken into account during the construction of the

corpus

4 Training

The corpus described in the previous section

was divided into a training and a test corpus As

it is shown by Biber (1990; 1993), it is possible

to represent the distributions o f many core

linguistic features o f a stylistic category based

on relatively few texts from each category (i.e.,

as few as ten texts) Thus, for each author 10

texts were used for training and I 0 for testing

All the texts were analyzed using SCBD which

provided a vector o f 22 style markers for each

text Then, the statistical methodology o f

multivariate linear multiple regression was

applied to the training corpus Multiple

regression provides predicting values o f a group

of response (dependent) variables from a

collection o f p r e d i c t o r (independent) variable

values The response is expressed as a linear

combination of the predictor variables, namely:

y~=bo + zlblt + z2b2i + + zrbri + e~

where y, is the response for the i-th author, zi,

ze, and Zr are the predictor variables (i.e., in our

case r=22), bo, bl,, b2,, , and br,, are the

unknown coefficients, and e, is the random

error During the training procedure the

unknown coefficients for each author are

determined using binary values for the response

variable (i.e., I for the texts written by the

author in question, 0 for the others) Thus, the

greater the response variable of a certain author,

the more likely to be the author o f the text

Some statistics measuring the degree to

which the regression functions fit the training

data are presented in table 2 Notice that R e is

the coefficient o f determination defined as

follows:

t /

R 2 - j=l

~ ~(yj _ y)2

j=l

where n is the total number of training data

(texts), y is the mean response, )3j and yj are

the estimated response and the training response

value of the j-th author respectively

Additionally, a significant F-value implies that a

statistically significant proportion o f the total variation in the dependent variable is explained Table 2 Statistics o f the Regression Functions

Code l R 2 [ F V a l u e

A01 0.40 2.32 A02 0.72 9.12 A03 0.44 2.80 A04 0.44 2.80 A05 0.32 1.61 A06 0.51 3.57 A07 0.59 5.13 A08 0.35 1.87 A09 0.53 4.00 A10 0.63 5.90

It has to be noted that we use this particular discrimination method due to the facility offered

in the computation o f the unknown coefficients

as well as the computationally simple calculation o f the predictor values However, we believe that any other methodology for discrimination-classification can be applied (e.g., discriminant analysis, neural networks, etc.)

5 Performance

Before proceeding to the presentation o f the analytical results o f our disambiguation method,

a representation of the test corpus into a dimensional space would illustrate the main differences and similarities between the authors Towards this end, we performed a principal components analysis and the representation of the 100 texts o f the test corpus in the space defined by the first and the second principal components (i.e., accounting for the 43% of the total variation) is depicted in figure 1 As can be seen, the majority of the texts written by the same author tend to cluster Nevertheless, these clusters cannot be clearly separated

According to our approach, the criterion for identifying the author of a text is the value of the response linear function Hence, a text is classified to the author whose response value is the greatest The confusion matrix derived from the application o f the disambiguation procedure

to the test corpus is presented in table 3, where each row contains the responses for the ten test texts of the corresponding author The last column refers to the identification error (i.e.,

Trang 5

i

i

6 7

! ) 4)

X'O

i

0

- - - 0

0

- 4

• +

- 6 -

- 8 +

t-

o

@

-10 J

First principal component

X

X

X •

• ~ ~ g []

A

rl

2 []

+

@

+

[]

+ + • +

• +

Figure 1 The Test Corpus in the Space of the First Two Principal Components

Table 3 Confusion Matrix of the Author Identification Experiment

A01

A02

t , A03

X A 0 4

6

o A05

• A06

+ A07

[] A08

- A09

& A I 0

A 0 9 I A 1 0

Average

Error

0.7 0.0

0.2 0.1 0.7 0.3

0.0

0.6 0.1

0.4

erroneously classified texts / total texts) for each

author Approximately 65% of the average

identification error corresponds to three authors,

namely: A01, A05, and A08 Notice that these

are the authors with an average text-size smaller

than 1,000 words (see table 1) It appears,

therefore, that a text sample o f relatively short

size (i.e., less than 1,000 words) is not adequate

characteristics o f an author's style Notice that

similar conclusions are drawn by Biber (1990;

1993)

Instead o f trying to identify who the author

of a text is, some applications require the

verification o f the hypothesis that a given person

is the author of the text In such a case, only the response function o f the author in question is

involved Towards this end, a threshold value

Thus, if the response value for the given author

is greater than the threshold then the author is accepted

Trang 6

FR FA Mean

9 - z

0 8

0.7 ~

0.6 ~-

0.4 ~ "-

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x R

Figure 2 FR, FA, and Mean Error as Functions of Subdivisions o f R

certain author, we defined False Rejection (FR)

and False Acceptance (FA) as follows:

FR = rejected texts o f the author

total texts o f the author

FA = accepted texts o f the author

total text of other authors Similar measures are widely utilized in the

area of speaker recognition in speech processing

(Fakotakis, et al., 1993)

The multiple correlation coefficient

R = +x/R 2 of a regression function (see table 2)

equals 1 if the fitted equation passes through all

the data points At the other extreme, it equals 0

The fluctuation of average FR, FA, and mean

error (i.e., (FR+FA)/2) for the entire test corpus

using subdivisions of R as threshold (x-axis) is

shown in figure 2, and the minimum mean error

corresponds to R/2 Notice that by choosing the

threshold based on the minimal mean error the

majority of applications is covered On the other

hand, some applications require either minimal

FR or FA, and this fact has to be taken into

account during the selection o f the threshold

The results of the author verification

experiment using R/2 as threshold are presented

in table 4 Approximately 70% of the total false

rejection corresponds to the authors A01, A05,

A08 as in the case of author identification On

the other hand, false acceptance seems to be

highly relevant to the threshold value The smaller the threshold value, the greater the false acceptance Thus, the authors A03, A04, A05, and A08 are responsible for 72% of the total false acceptance error

Table 4 Author Verification Results

"threshold=R/2)

Code I R/2 [ FR I FA

Average 0.35 0.22 [ 0.068 Finally, the total time cost (i.e., text processing by SCBD, calculation o f style markers, computation o f response values) for the entire test corpus was 58.64 seconds, or 1,971 words per second, using a Pentium at 350 MHz

6 C o n c l u s i o n s

We presented an approach to automatic authorship attribution of real-world texts A

Trang 7

computational tool was used for the automatic

extraction of the style markers In contrast to

other proposed systems we took advantage of

this procedure in order to extract analysis-level

style markers that represent the way in which

the text has been analyzed The experiments

based on texts taken from a weekly Modem

Greek newspaper prove that the stylistic

differences among a wide range of authors can

be easily detected using the proposed set of style

markers Both author identification and author

verification tasks have given encouraging

results

Moreover, no lexically-based measures, such

as word frequencies, are involved This

approach can be applied to a wide-variety o f

authors and types of texts since any domain-

dependent, genre-dependent, author-dependent

style marker have not been taken into account

Although our method has been tested on Modem

Greek, it requires no language-specific

information The only prerequisite of this

method in order to be employed in another

language is the availability of a text-processing

tool of general purpose and the appropriate

selection of the analysis-level measures

The presented approach is fully-automated

since it is not based on specialized text

preprocessing requiring manual effort

Nevertheless, we believe that the accuracy

results may be significantly improved by

employing text-sampling procedures for

selecting the parts of text that best illustrate the

stylistic features of an author

Regarding the amount of required training

data, we proved that ten texts are adequate for

representing the stylistic features of an author

Some experiments we performed using more

than ten texts as training corpus for each author

did not improved significantly the accuracy

results It has been also shown that a lower

bound of the text-size is 1,000 words

Nevertheless, we believe that this limitation

affects mainly authors with vague stylistic

characteristics

We are currently working on the application

of the presented methodology to text-genre

detection as well as to any stylistically

homogeneous group of real-world texts We also

aim to explore the usage of a variety o f

computational tools for the extraction of

analysis-level style markers for Modem Greek and other natural languages

References

Baayen, H., H Van Halteren, and F Tweedie

1996, Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution, Literary and Linguistic Computing,

11(3): 121-131

Biber, D 1990, Methodological Issues Regarding Corpus-based Analyses of Linguistic Variations, Literary and Linguistic Computing,

5: 257-269

Biber, D 1993, Representativeness in Corpus Design, Literary and Linguistic Computing, 8: 1-15

Burrows, J 1987, Word-patterns and Story- shapes: The Statistical Analysis of Narrative Style, Literary and Linguistic Computing, 2(2): 61-70

Burrows, J 1992, Not Unless You Ask Nicely: The Interpretative Nexus Between Analysis and Information, Literary and Linguistic Computing, 7(2): 91-109

Dolnet, 1998, TO BHMA, Lambrakis Publishing Corporation, http://tovima.dolnet.gr/ Fakotakis, N., A Tsopanoglou, and G Kokkinakis, 1993, A Text-independent Speaker Recognition System Based on Vowel Spotting,

Speech Communication, 12: 57-68

Holmes, D 1994, Authorship Attribution,

Computers and the Humanities, 28: 87-106 Holmes, D and R Forsyth 1995, The Federalist Revisited: New Directions in Authorship Attribution, Literary and Linguistic Computing, 10(2): 111-127

Honore, A., 1979, Some Simple Measures o f Richness of Vocabulary, Association for Literary and Linguistic Computing Bulletin, 7(2): 172-177

Mosteller, F and D Wallace 1984, Applied Bayesian and Classical Inference." The Case of the Federalist Papers, Addison-Wesley, Reading, MA

Stamatatos, E., N Fakotakis, and G Kokkinakis forthcoming, On Detecting Sentence and Chunk Boundaries in Unrestricted Text Based on Minimal Resources

Yule, G 1944, The Statistical Study of Literary Vocabulary, Cambridge University Press, Cambridge

Ngày đăng: 22/02/2014, 03:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN