Data Analysis Machine Learning and Applications Episode 3 Part 7 ppt

A New Interval Data Distance Based on the Wasserstein Metric 7072 In practice, they consider the expected value of the distance between all the points belonging to interval A and all tho

Trang 1

A New Interval Data Distance Based on the Wasserstein Metric 707

2

In practice, they consider the expected value of the distance between all the points

belonging to interval A and all those points belonging to interval B In their paper,

they ensure that it is a distance, but it is easy to observe that the distance does notsatisfy the first properties mentioned above Indeed, the distance of an interval byitself is equal to zero only if the interval is thin:

d T D (A,A) =a+b

2 −a+b2 2+13?b−a

2 2+b−a

2

2@

=2 3

b−a2

2

≥ 0 (2)

Hausdorff-based distances The most common distance used for the comparison of

two sets is the Hausdorff distance2 Considering two sets A and B of points ofRn,

and a distance d (x,y) where x ∈ A and y ∈ B, the Hausdorff distance is defined as

follows:

d H (A,B) = max

sup

If d (x,y) is the L1City block distance, then Chavent et al (2002) proved that

d H (A,B) = max(|a − u|,|b − v|) =a+b

de-L q distances between the bounds of intervals A family of distances between

inter-vals has been proposed by De Carvalho et al (2006) Considering a set of intervaldata described into a spaceRp , the metric of norm q is defined as:

d L q (A,B) =

p

j=1 |a − u| q + |b − v| q

1/q

They also showed that if the norm is Lfthen d Lf= d H (in L1norm)

The same measure was extended (De Carvalho (2007)) to an adaptive one in order

to take into account the variability of the different clusters in a dynamical clusteringprocess

3 Our proposal: Wasserstein distance

If we suppose a uniform distribution of points, an interval of reals A (t) = [a, b] can

be expressed as the following type of function:

2The name is related to Felix Hausdorff, who is well-known for the separability theorem ontopological spaces at the end of the 19thcentury

Trang 2

708 Rosanna Verde and Antonio Irpino

If we consider a description of the interval by means of its midpoint m and radius r,

the same function can be rewritten as follows:

Then, the squared Euclidean distance between homologous points of two intervals

A = [a,b] and B = [u,v], or described by the midpoint-radius notation A = (m A, rA)

and B = (m B, rB), is defined as follows:

form density functions U (a,b) and U(u,v) In this way, we may use the

Monge-Kantorivich-Wasserstein-Gini metric (Gibbs and Su, (2002)) Let < be a distributionfunction; <−1is the corresponding quantile function Given two univariate randomvariables \Aand \B, the Wasserstein-Kantorovich distance is defined as:

zB = E(B)), V A= VAR (A) (resp V B= VAR (B)) and Corr QQas the correlation

of the quantiles of <Aand <B, Irpino and Romano (2007) proved that the (10) can

be decomposed as:

d2

W(\A, \B ) = (z A − zB)2+(VA − VB)2+2VAVB [1 −Corr QQ(<A, <B)] (12)The proposed decomposition allows the effect of the two densities on the distancegenerated by different location, different size and different shape to be considered

Trang 3

A New Interval Data Distance Based on the Wasserstein Metric 709

In order to calculate the distance between two elements described by p interval

vari-ables, we propose the following extension of the distance to the multivariate case inthe sense of Minkowski:

d W (A,B) =

8 p

the clusters by means of prototypes (Chavent et al (2006)) In the literature, several authors indicate how to compute prototypes In particular, Verde and Lauro (2000) proposed that the prototype of a cluster must be considered as an element having

the same properties of the clustered elements In such a way, a cluster of intervals

is described by a single prototypal interval, in the same way as a cluster of points is

represented by its barycenter

Let E be a set of n data described by p interval variables X j ( j = 1, , p) The eral DCA looks for the partition P ∈ P k of E in k classes, among all the possible partitions P k , and the vector L ∈ L k of k prototypes representing the classes in P, such that, the following ' fitting criterion between L and P is minimized:

gen-'(P ∗ , L ∗ ) = Min{'(P,L) | P ∈ P k, L ∈ Lk}. (14)

Such a criterion is defined as the sum of dissimilarity or distance measures G(x i, Gh)

of fitting between each object x i belonging to a class C h ∈ P and the class tation G h ∈ L:

A prototype G h associated to a class C his an element of the space of the description

of E, and it can be represented as a vector of intervals The algorithm is initialized

by generating k random clusters or, alternatively, k random prototypes Generally, the criterion '(P,L) is based on an additive distance on the p descriptors.

In the present paper, we present an application based on a dynamic clustering of areal-world data set The data set used in our experiments is the interval temperaturedataset shown in Table 1, which was previously used as a benchmark interval datafor cluster analysis in De Carvalho (2007), Guru and Kiranagi (2005) and Guru et

Trang 4

Table 1 The temperature dataset

al (2004) We performed a dynamic clustering using as the allocation function the

Hausdorff L1distance, the L2of De Carvalho et al (2006), the De Carvalho

adap-tive distance (De Souza et al (2004)) and the L2Wasserstein one alternatively Wechose to obtain a partition into four clusters, and we compared the resulting par-tition to that a priori one given by experts using the Corrected Rand Index The

expert classification were the following (Guru et al (2004)): Class 1 (Bahrain,

Bom-bay, Cairo, Calcutta, Colombo, Dubai, Hong Kong, Kula Lampur, Madras, Manila,

Mexico, Nairobi, New Delhi, Sidney); Class 2 (Amsterdam, Athens, Copenhagen,

Frankfurt, Geneva, Lisbon, London, Madrid, Moscow, Munich, New York, Paris,

Rome, San Francisco, Seoul, Stockholm, Tokyo, Toronto, Vienna, Zurich); Class 3 (Mauritius); Class 4 (Tehran).

Using the three different allocation functions, we obtained 3 optimal partitionsinto 4 clusters (Tab.) 2) On the basis of the dynamic clustering, we evaluated theobtained partitions with respect to the a priori ones using the Corrected Rand Indices(Hubert and Arabie, (1985))

5 Conclusion and perspectives

Interval descriptions can be derived from measurements subject to error (z ± e) If

they are assumed to be (probabilistic) models for the error term, Hausdorff distances

are not influenced by the distribution of values and the L qimplicitly considers thatall the information is equally concentrated on the bounds of intervals The Wasser-stein distance permits the different position, variability and shape of the compareddistributions to be evaluated and taken separately into account, clearing way for inter-preting data results With a few modifications, it can also be used for the comparison

of two fuzzy numbers measured by LR fuzzy variables Further, being an Euclideandistance, it is easy to show that the Wasserstein distance satisfies the König-Huygenstheorem for the decomposition of inertia This allows us to apply the usual indices

based on the comparison between the inter and the intra groups’ inertia for the

eval-uation and the interpretation of the results of a clustering or of a classification dure

Trang 5

proce-A New Interval Data Distance Based on the Wasserstein Metric 711

Table 2 Clusters obtained using different allocation functions Last row: Corrected Rand

In-dex (CRI) of the obtained partition compared with the expert partition

c L2Wasserstein Adaptive L2 Hausdorff L1distance

1

Bahrain Bombay Cairo Calcutta

Colombo Dubai HongKong

KulaLumpur Madras Manila

NewDelhi

Bahrain Bombay Calcutta Colombo Dubai HongKong KulaLumpur Madras Manila NewDelhi

Bahrain Dubai HongKong NewDelhi Cairo MexicoCity Nairobi

2

Amsterdam Copenhagen Frankfurt

Geneva London Moscow Munich

Paris Stockholm Toronto Vienna

Zurich

Amsterdam Copenhagen Frankfurt Geneva London Moscow Munich Paris Stockholm Toronto Vienna

Amsterdam Copenhagen Frankfurt Geneva London Moscow Munich Paris Stockholm Toronto Vienna Zurich

4

Athens Lisbon Madrid New York

Rome SanFrancisco Seoul Tehran

Tokyo

Athens Lisbon Madrid New York Rome SanFrancisco Seoul Tehran Tokyo Zurich

Athens Lisbon Madrid NewYork Rome SanFrancisco Seoul Tehran Tokyo

On the other hand, a lot of effort is required for the extension of the distance tothe multivariate case Indeed, here we just proposed an extension (in the sense ofMinkowski) of the distance under the hypothesis of independence between the de-scriptors of a multidimensional interval datum

References

BARRIO, E., MATRAN, C., RODRIGUEZ-RODRIGUEZ, J and CUESTA-ALBERTOS,

J.A (1999): Tests of goodness of fit based on the L2-Wasserstein distance Annals of

Statistics , 27, 1230-1239.

COPPI, R., GIL, M.A., and KIERS, H.A.L (2006): The fuzzy approach to statistical analysis

Computational statistics and data analysis, 51, 1-14.

BOCK, H.H and DIDAY, E., (2000): Analysis of Symbolic Data, Exploratory Methods for

Extracting Statistical Information from Complex Data Springer-Verlag, Heidelberg.

CHAVENT, M., and LECHEVALLIER, Y (2002): Dynamical clustering algorithm of intervaldata: optimization of an adequacy criterion based on Hausdorff distance In: Sokokowsky,

A., Bock H H (Eds.): Classification, Clustering and Data Analysis, Springer,

Heidel-berg, 53–59

CHAVENT, M., DE CARVALHO, F.A.T., LECHEVALLIER, Y., and VERDE, R (2006):

New clustering methods for interval data, Computational statistics, 21, 211–229.

DE CARVALHO, F.A.T (2007): Fuzzy c-means clustering methods for symbolic interval

data.Pattern Recognition Letters, 28, 423–437

DE CARVALHO, F.A.T., BRITO, P., and BOCK, H (2006): Dynamic clustering for interval

data based on L2 distance Computational Statistics, 21, 2, 231-250

Trang 6

DE SOUZA, R M C R and DE CARVALHO, F DE A T (2004): Clustering of

Interval-Valued Data Using Adaptive Squared Euclidean Distances In Proc of ICONIP 2004,

775-780.

DIDAY, E (1971): La meéthode des Nueées dynamiques Rev Statist Appl 19 (2), 19–34 GIBBS, A.L and SU, F.E (2002): On choosing and bounding probability metrics, Interna-

tional Statistical Review, 70, 419.

GURU, D S and KIRANAGI, B B (2005): Multivalued type dissimilarity measure and

con-cept of mutual dissimilarity value for clustering symbolic patterns Pattern Recognition,

38, 1, 151-156.

GURU, D S., KIRANAGI, B B and NAGABHUSHAN, P (2004): Multivalued type imity measure and concept of mutual similarity value useful for clustering symbolic pat-

prox-terns Pattern Recognition Letters, 25, 10, 1203-1213.

HUBERT, L and ARABIE, P (1985): Comparing partitions Journal of Classification, 2, 193–

218.

IRPINO, A and ROMANO, E (2007): Optimal histogram representation of large data

sets: Fisher vs piecewise linear approximations Revue des Nouvelles Technologies de

l’Information, RNTI-E-9, 99–110.

TRAN, L and DUCKSTEIN, L (2002): Comparison of fuzzy numbers using a fuzzy distance

measure, Fuzzy Sets and Systems, 130, 331–341.

VERDE, R and LAURO, N (2000): Basic choices and algorithms for symbolic objects

dy-namical clustering, in: XXXIIe Journées de Statistique,Fés, Maroc, Societé Française de

Statistique, 38–42.

Trang 7

Automatic Analysis of Dewey Decimal Classification Notations

Ulrike ReinerVerbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)

37077 Göttingen, Germany

ulrike.reiner@gbv.de

Abstract The Dewey Decimal Classification (DDC) was conceived by Melvil Dewey in

1873 and published in 1876 Nowadays, the DDC serves as a library classification system inabout 138 countries worldwide Recently, the German translation of the DDC was launched,and since then the interest in DDC has rapidly increased in German-speaking countries Thecomplex DDC system (Ed 22) allows to synthesize (to build) a huge amount of DDC no-tations (numbers) with the aid of instructions Since the meaning of built DDC numbers isnot obvious – especially to non-DDC experts – a computer program has been written that au-tomatically analyzes DDC numbers Based on Songqiao Liu’s dissertation (Liu (1993)), ourprogram decomposes DDC notations from the main class 700 (as one of the ten main classes)

In addition, our program analyzes notations from all ten classes and determines the meaning

of every semantic atom contained in a built DDC notation The extracted DDC atoms can beused for information retrieval, automatic classification, or other purposes

1 Introduction

While searching for books, journals, or web resources, you will often come acrossnumbers such as "025.1740973", "016.02092", or "720.7073" What do they mean?Librarian professionals will identify these strings as numbers (notations) of theDewey Decimal Classification (DDC), which is named after its creator, Melvil Dewey.Originally, Dewey designed the classification for libraries, but in the meantime DDChas also been discovered for classifying the web or other resources The DDC is used,among others, because it has a long-standing tradition and is still up to date: in order

to cope with scientific progress, it is currently under development by a ten-memberinternational board (the Editorial Policy Committee, EPC) While the first edition,which was published in 1876, only comprised a few pages, the current 22nd edition

of the DDC spans a four-volume work with almost 4,000 pages Today, the DDCcontains approx 48,000 DDC notations and about 8,000 instructions The DDC no-tations are enumerated in the schedules and tables of the DDC With the aid of theinstructions mentioned above, human classifiers can build new synthesized notations(numbers) if these are not specifically listed in the DDC schedules This way, anenormous amount of synthesized DDC notations has been built intellectually over

Trang 8

2 DDC notations

Notations play an important role in the DDC:

"Notation is the system of symbols used to represent the classes in a tion system The notation provides a universal language to identify the class andrelated classes, regardless of the fact that different words or languages may be used

classifica-to describe the class." (http://www.oclc.org/dewey/versions/ddc22/intro.pdf)The following picture serves as an example for the aforesaid Class C is rep-resented by the notation 025.43 or, respectively, by the captions of three differentlanguages:

Fig 1 Class C represented by notation 025.43 or by several captions

In compliance with the DDC system, the automatic analysis of notations of the

DDC is carried out in the VZG (VerbundZentrale des Gemeinsamen bundes) project Colibri (COntext generation and LInguistic tools for Bibliographic Retrieval Interfaces) The goal of this project is to enrich title records on the basis of

Bibliotheksver-the DDC to improve retrieval The analysis of DDC notations is conducted under Bibliotheksver-thefollowing research questions (which are also posed in a similar way in Liu (1993),

p 18): Q1 Is it possible to automatically decompose molecular DDC notations into

Trang 9

Automatic Analysis of Dewey Decimal Classification Notations 699atomic DDC notations? Q2 Is it possible to improve automatic classification andretrieval by means of atomic DDC notations? An atomic DDC notation is a semanti-cally indecomposable string (of symbols) that represents a DDC class A molecularDDC notation is a string that is syntactically decomposable into atomic DDC nota-tions.

DDC notations can be found at several places in the DDC In DDC summaries,the notations for the main classes (or tens), the divisions (or hundreds), and thesections (or thousands) are enumerated Other notations are listed in the schedules("DDC schedule notations") or tables ("DDC table notations") or internal tables.DDC schedules are "the series of DDC numbers 000-999, their headings (captions),and notes." (Mitchell (1996), p lxv) A DDC table is "a table of numbers that may beadded to other numbers to make a class number appropriately specific to the work be-ing classified" (Mitchell (1996), p lxv) Further notations are contained in the "Rel-ative Index" of the DDC The frequency distributions of schedule (table) notationsare shown in Fig 2 (Fig 3), while schedno0 is short hand for DDC schedule nota-tions beginning with 0, schedno1 for DDC schedule notations beginning with 1, etc.The captions for the main classes are: 000: Computer science, information & gen-eral works; 100: Philosophy & psychology; 200: Religion; 300: Social sciences; 400:Language; 500: Science; 600: Technology; 700: Arts & recreation; 800: Literature;900: History & geography As illustrated by Fig 2, DDC notations are not distributeduniformly: the most schedule notations can be found in the class "Technology", fol-lowed by the notations in the class "Social sciences" The fewest notations belong

to the class "Philosophy & psychology" With regard to the table notations (Fig 3),the 7,816 Table 2 notations ("Geographic Areas, Historical Periods, Persons") standout, whereas, in contrast, the quantities of all other table notations are comparativelysmall (Table 1: Standard Subdivisions; Table 3: Subdivisions for the Arts, for Indi-vidual Literatures, for Specific Literary Forms; Table 4: Subdivisions of IndividualLanguages and Language Families; Table 5: Ethnic and National Groups; Table 6:Languages)

As mentioned before, DDC notations that are not explicitly listed in the schedulescan be built by using DDC instructions This process is called "notational synthesis"

or "number building" Its results are synthesized DDC notations (molecular DDCnotations) that usually only DDC experts are able to interpret But with the aid ofour computer program "DDC analyzer", the meaning of molecular DDC notations

is revealed and the determined atomic DDC notations can be used, among others, toanswer question Q2

3 Automatic analysis of DDC notations

The GBV Union Catalog GVK (Gemeinsamer VerbundKatalog, http://gso.

gbv.de/) contains 3,073,423 intellectually DDC-classified title records (status: July,2004) After the automatic elimination of segmentation marks, obviously incorrectDDC notations (3.8 per cent of all DDC notations), and duplicate DDC notations, atotal of 466,134 different DDC notations is available for the automatic analysis of

Trang 10

700 Ulrike Reiner

Fig 2 Frequency distribution of DDC schedule notations

Fig 3 Frequency distribution of DDC table notations

DDC notations This set of all GVK DDC notations serves as input data for the DDCanalyzer The frequency of DDC schedule notations is as follows (in descendingorder): those beginning with 3 (189,246), with 9 (62,115), with 7 (52,632), with 6(51,704), with 5 (33,649), with 0 (23,946), with 2 (20,888), with 8 (20,678), with 4(6,680), and with 1 (4,596) The arity of DDC notations of all GVK DDC notations

Trang 11

Automatic Analysis of Dewey Decimal Classification Notations 701

is Gaussian distributed with a maximum at 10, i.e most DDC notations have approx.arity 10, the shortest DDC notation has arity 1, the longest DDC notation has arity

29 Other important input data for the DDC analyzer we used were the 600 DDCnumbers given in Liu’s dissertation These 600 DDC numbers that we call "Liu’ssample" were randomly selected from class 700 from the OCLC database by Liu

As a member of the Consortium DDC German, we have access to the readable data of the 22nd edition of the DDC system These data are stored in an xmlfile The English electronic web version is available as WebDewey

machine-(http://connexion.oclc.org/), the German pendant as MelvilClass deutsch.de/melvilclass-login) For our purpose, only the relevant data of the xml file,which contains the expert knowledge of the DDC system, are extracted and stored

(http://services.ddc-in a "knowledge base" Here, DDC notations, descriptors, and descriptor values arestored in consecutive fields, while facts and rules – as we call them – are represented

in a very similar way:

"Electronic resources" ’#’ serves as field separator The xml tags that are given inangle brackets stand for: "ba4" ("beginning of add table (all of table number)"), "na1"("add note (part of schedule number)") and "hat" ("hierarchy at class") "r1" and "r2",which follow "na1" or, respectively, "ba4", stand for the first two macro rules Theknowledge base contains 48,067 facts and 8,033 rules The 8,033 rules can be gener-alized to macro rules While Liu (1993) defined 17 (macro) rules for the decomposi-tion for class 700, we defined 25 macro rules for all DDC classes

Our program, the DDC analyzer, works as follows: after initializing variables, itreads the knowledge base and, triggered by one or more DDC notations to be an-alyzed, executes the analysis algorithm The number of correct and incorrect DDCnotations is counted For a DDC notation, there are two phases to the analyzing pro-cess including: determining the facts from left to right (phase 1) and determiningthe facts via rules from left to right (phase 2) After checking which output for-mat has to be printed, the result is printed as a DDC analysis diagram or as a DDCanalysis result set After all DDC notations have been analyzed, the number of to-tally/partially analyzed DDC notations is printed There are different reasons for apartially analyzed DDC notation: either the implementation of the DDC analyzer isincorrect/incomplete or the DDC notation is incorrectly synthesized or a part of theDDC system itself is incorrect

Trang 12

The title of this book is:

#aVoices in architectural education: #bcultural politics and

The subject headings for this book are:

#aArchitecture #xStudy and teaching #zUnited States

#aArchitecture and state #zUnited States

.-7- North America <na4r7span:T1–0701-T1–0709:T2–7>

.-73 United States <na4r7span:T1–0701-T1–0709:T2–73>

The information given in angle brackets should be read as follows: "hatzen" is theconcatenation of "hat" ("hierarchy at class") and "zen" ("zen built entry (main tag)")

"T1–" stands for "table 1", "T2–" for "table 2", "na4" for "add note (add of tablenumber)", "r7" for "macro rule 7", "span" for "span of numbers", and ":" for "delim-iter" As you can see, while Liu decomposes the synthesized DDC notation into threechunks, our DDC analysis diagram shows the finest possible analysis of the molecu-lar DDC notation The fine analysis provides the advantage of uncovering additionalcaptions: "Arts & recreation", "Architecture", "North America", and "Education, re-search, related topics"

A DDC analysis diagram contains analysis and synthesis information: 1 themolecular DDC notation to be analyzed; 2 an identifier (name) and the length ofthe molecular DDC notation; 3 the sequence and position of the digits within themolecular DDC notation; 4 the Dewey dot at position 4; 5 the relevant parts ofthe molecular DDC notation for each analysis step; 6 the corresponding caption forevery atomic DDC notation; 7 the parts irrelevant for the respective analysis stepmarked with "-"; 8 the type of the applied facts and rules that appear in angle brack-ets In case it has been explained how to read the given information mentioned in 8.,every synthesis step can be reproduced While DDC analysis diagrams are intendedfor human experts, the DDC analysis result set can be used for data transfer Cur-

& #34 ;T1–& #34 ; stands for & #34 ;table 1& #34 ;, & #34 ;T2–& #34 ;

Định dạng
Số trang	25
Dung lượng	615,07 KB