A New Interval Data Distance Based on the Wasserstein Metric 7072 In practice, they consider the expected value of the distance between all the points belonging to interval A and all tho
Trang 1A New Interval Data Distance Based on the Wasserstein Metric 707
2
In practice, they consider the expected value of the distance between all the points
belonging to interval A and all those points belonging to interval B In their paper,
they ensure that it is a distance, but it is easy to observe that the distance does notsatisfy the first properties mentioned above Indeed, the distance of an interval byitself is equal to zero only if the interval is thin:
d T D (A,A) =a+b
2 −a+b2 2+13?b−a
2 2+b−a
2
2@
=2 3
b−a2
2
≥ 0 (2)
Hausdorff-based distances The most common distance used for the comparison of
two sets is the Hausdorff distance2 Considering two sets A and B of points ofRn,
and a distance d (x,y) where x ∈ A and y ∈ B, the Hausdorff distance is defined as
follows:
d H (A,B) = max
sup
If d (x,y) is the L1City block distance, then Chavent et al (2002) proved that
d H (A,B) = max(|a − u|,|b − v|) =a+b
de-L q distances between the bounds of intervals A family of distances between
inter-vals has been proposed by De Carvalho et al (2006) Considering a set of intervaldata described into a spaceRp , the metric of norm q is defined as:
d L q (A,B) =
p
j=1 |a − u| q + |b − v| q
1/q
They also showed that if the norm is Lfthen d Lf= d H (in L1norm)
The same measure was extended (De Carvalho (2007)) to an adaptive one in order
to take into account the variability of the different clusters in a dynamical clusteringprocess
3 Our proposal: Wasserstein distance
If we suppose a uniform distribution of points, an interval of reals A (t) = [a, b] can
be expressed as the following type of function:
2The name is related to Felix Hausdorff, who is well-known for the separability theorem ontopological spaces at the end of the 19thcentury
Trang 2708 Rosanna Verde and Antonio Irpino
If we consider a description of the interval by means of its midpoint m and radius r,
the same function can be rewritten as follows:
Then, the squared Euclidean distance between homologous points of two intervals
A = [a,b] and B = [u,v], or described by the midpoint-radius notation A = (m A, rA)
and B = (m B, rB), is defined as follows:
form density functions U (a,b) and U(u,v) In this way, we may use the
Monge-Kantorivich-Wasserstein-Gini metric (Gibbs and Su, (2002)) Let < be a distributionfunction; <−1is the corresponding quantile function Given two univariate randomvariables \Aand \B, the Wasserstein-Kantorovich distance is defined as:
zB = E(B)), V A= VAR (A) (resp V B= VAR (B)) and Corr QQas the correlation
of the quantiles of <Aand <B, Irpino and Romano (2007) proved that the (10) can
be decomposed as:
d2
W(\A, \B ) = (z A − zB)2+(VA − VB)2+2VAVB [1 −Corr QQ(<A, <B)] (12)The proposed decomposition allows the effect of the two densities on the distancegenerated by different location, different size and different shape to be considered
Trang 3A New Interval Data Distance Based on the Wasserstein Metric 709
In order to calculate the distance between two elements described by p interval
vari-ables, we propose the following extension of the distance to the multivariate case inthe sense of Minkowski:
d W (A,B) =
8 p
the clusters by means of prototypes (Chavent et al (2006)) In the literature, several authors indicate how to compute prototypes In particular, Verde and Lauro (2000) proposed that the prototype of a cluster must be considered as an element having
the same properties of the clustered elements In such a way, a cluster of intervals
is described by a single prototypal interval, in the same way as a cluster of points is
represented by its barycenter
Let E be a set of n data described by p interval variables X j ( j = 1, , p) The eral DCA looks for the partition P ∈ P k of E in k classes, among all the possible partitions P k , and the vector L ∈ L k of k prototypes representing the classes in P, such that, the following ' fitting criterion between L and P is minimized:
gen-'(P ∗ , L ∗ ) = Min{'(P,L) | P ∈ P k, L ∈ Lk}. (14)
Such a criterion is defined as the sum of dissimilarity or distance measures G(x i, Gh)
of fitting between each object x i belonging to a class C h ∈ P and the class tation G h ∈ L:
A prototype G h associated to a class C his an element of the space of the description
of E, and it can be represented as a vector of intervals The algorithm is initialized
by generating k random clusters or, alternatively, k random prototypes Generally, the criterion '(P,L) is based on an additive distance on the p descriptors.
In the present paper, we present an application based on a dynamic clustering of areal-world data set The data set used in our experiments is the interval temperaturedataset shown in Table 1, which was previously used as a benchmark interval datafor cluster analysis in De Carvalho (2007), Guru and Kiranagi (2005) and Guru et
Trang 4710 Rosanna Verde and Antonio Irpino
Table 1 The temperature dataset
al (2004) We performed a dynamic clustering using as the allocation function the
Hausdorff L1distance, the L2of De Carvalho et al (2006), the De Carvalho
adap-tive distance (De Souza et al (2004)) and the L2Wasserstein one alternatively Wechose to obtain a partition into four clusters, and we compared the resulting par-tition to that a priori one given by experts using the Corrected Rand Index The
expert classification were the following (Guru et al (2004)): Class 1 (Bahrain,
Bom-bay, Cairo, Calcutta, Colombo, Dubai, Hong Kong, Kula Lampur, Madras, Manila,
Mexico, Nairobi, New Delhi, Sidney); Class 2 (Amsterdam, Athens, Copenhagen,
Frankfurt, Geneva, Lisbon, London, Madrid, Moscow, Munich, New York, Paris,
Rome, San Francisco, Seoul, Stockholm, Tokyo, Toronto, Vienna, Zurich); Class 3 (Mauritius); Class 4 (Tehran).
Using the three different allocation functions, we obtained 3 optimal partitionsinto 4 clusters (Tab.) 2) On the basis of the dynamic clustering, we evaluated theobtained partitions with respect to the a priori ones using the Corrected Rand Indices(Hubert and Arabie, (1985))
5 Conclusion and perspectives
Interval descriptions can be derived from measurements subject to error (z ± e) If
they are assumed to be (probabilistic) models for the error term, Hausdorff distances
are not influenced by the distribution of values and the L qimplicitly considers thatall the information is equally concentrated on the bounds of intervals The Wasser-stein distance permits the different position, variability and shape of the compareddistributions to be evaluated and taken separately into account, clearing way for inter-preting data results With a few modifications, it can also be used for the comparison
of two fuzzy numbers measured by LR fuzzy variables Further, being an Euclideandistance, it is easy to show that the Wasserstein distance satisfies the König-Huygenstheorem for the decomposition of inertia This allows us to apply the usual indices
based on the comparison between the inter and the intra groups’ inertia for the
eval-uation and the interpretation of the results of a clustering or of a classification dure
Trang 5proce-A New Interval Data Distance Based on the Wasserstein Metric 711
Table 2 Clusters obtained using different allocation functions Last row: Corrected Rand
In-dex (CRI) of the obtained partition compared with the expert partition
c L2Wasserstein Adaptive L2 Hausdorff L1distance
1
Bahrain Bombay Cairo Calcutta
Colombo Dubai HongKong
KulaLumpur Madras Manila
NewDelhi
Bahrain Bombay Calcutta Colombo Dubai HongKong KulaLumpur Madras Manila NewDelhi
Bahrain Dubai HongKong NewDelhi Cairo MexicoCity Nairobi
2
Amsterdam Copenhagen Frankfurt
Geneva London Moscow Munich
Paris Stockholm Toronto Vienna
Zurich
Amsterdam Copenhagen Frankfurt Geneva London Moscow Munich Paris Stockholm Toronto Vienna
Amsterdam Copenhagen Frankfurt Geneva London Moscow Munich Paris Stockholm Toronto Vienna Zurich
4
Athens Lisbon Madrid New York
Rome SanFrancisco Seoul Tehran
Tokyo
Athens Lisbon Madrid New York Rome SanFrancisco Seoul Tehran Tokyo Zurich
Athens Lisbon Madrid NewYork Rome SanFrancisco Seoul Tehran Tokyo
On the other hand, a lot of effort is required for the extension of the distance tothe multivariate case Indeed, here we just proposed an extension (in the sense ofMinkowski) of the distance under the hypothesis of independence between the de-scriptors of a multidimensional interval datum
References
BARRIO, E., MATRAN, C., RODRIGUEZ-RODRIGUEZ, J and CUESTA-ALBERTOS,
J.A (1999): Tests of goodness of fit based on the L2-Wasserstein distance Annals of
Statistics , 27, 1230-1239.
COPPI, R., GIL, M.A., and KIERS, H.A.L (2006): The fuzzy approach to statistical analysis
Computational statistics and data analysis, 51, 1-14.
BOCK, H.H and DIDAY, E., (2000): Analysis of Symbolic Data, Exploratory Methods for
Extracting Statistical Information from Complex Data Springer-Verlag, Heidelberg.
CHAVENT, M., and LECHEVALLIER, Y (2002): Dynamical clustering algorithm of intervaldata: optimization of an adequacy criterion based on Hausdorff distance In: Sokokowsky,
A., Bock H H (Eds.): Classification, Clustering and Data Analysis, Springer,
Heidel-berg, 53–59
CHAVENT, M., DE CARVALHO, F.A.T., LECHEVALLIER, Y., and VERDE, R (2006):
New clustering methods for interval data, Computational statistics, 21, 211–229.
DE CARVALHO, F.A.T (2007): Fuzzy c-means clustering methods for symbolic interval
data.Pattern Recognition Letters, 28, 423–437
DE CARVALHO, F.A.T., BRITO, P., and BOCK, H (2006): Dynamic clustering for interval
data based on L2 distance Computational Statistics, 21, 2, 231-250
Trang 6712 Rosanna Verde and Antonio Irpino
DE SOUZA, R M C R and DE CARVALHO, F DE A T (2004): Clustering of
Interval-Valued Data Using Adaptive Squared Euclidean Distances In Proc of ICONIP 2004,
775-780.
DIDAY, E (1971): La meéthode des Nueées dynamiques Rev Statist Appl 19 (2), 19–34 GIBBS, A.L and SU, F.E (2002): On choosing and bounding probability metrics, Interna-
tional Statistical Review, 70, 419.
GURU, D S and KIRANAGI, B B (2005): Multivalued type dissimilarity measure and
con-cept of mutual dissimilarity value for clustering symbolic patterns Pattern Recognition,
38, 1, 151-156.
GURU, D S., KIRANAGI, B B and NAGABHUSHAN, P (2004): Multivalued type imity measure and concept of mutual similarity value useful for clustering symbolic pat-
prox-terns Pattern Recognition Letters, 25, 10, 1203-1213.
HUBERT, L and ARABIE, P (1985): Comparing partitions Journal of Classification, 2, 193–
218.
IRPINO, A and ROMANO, E (2007): Optimal histogram representation of large data
sets: Fisher vs piecewise linear approximations Revue des Nouvelles Technologies de
l’Information, RNTI-E-9, 99–110.
TRAN, L and DUCKSTEIN, L (2002): Comparison of fuzzy numbers using a fuzzy distance
measure, Fuzzy Sets and Systems, 130, 331–341.
VERDE, R and LAURO, N (2000): Basic choices and algorithms for symbolic objects
dy-namical clustering, in: XXXIIe Journées de Statistique,Fés, Maroc, Societé Française de
Statistique, 38–42.
Trang 7Automatic Analysis of Dewey Decimal Classification Notations
Ulrike ReinerVerbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
37077 Göttingen, Germany
ulrike.reiner@gbv.de
Abstract The Dewey Decimal Classification (DDC) was conceived by Melvil Dewey in
1873 and published in 1876 Nowadays, the DDC serves as a library classification system inabout 138 countries worldwide Recently, the German translation of the DDC was launched,and since then the interest in DDC has rapidly increased in German-speaking countries Thecomplex DDC system (Ed 22) allows to synthesize (to build) a huge amount of DDC no-tations (numbers) with the aid of instructions Since the meaning of built DDC numbers isnot obvious – especially to non-DDC experts – a computer program has been written that au-tomatically analyzes DDC numbers Based on Songqiao Liu’s dissertation (Liu (1993)), ourprogram decomposes DDC notations from the main class 700 (as one of the ten main classes)
In addition, our program analyzes notations from all ten classes and determines the meaning
of every semantic atom contained in a built DDC notation The extracted DDC atoms can beused for information retrieval, automatic classification, or other purposes
1 Introduction
While searching for books, journals, or web resources, you will often come acrossnumbers such as "025.1740973", "016.02092", or "720.7073" What do they mean?Librarian professionals will identify these strings as numbers (notations) of theDewey Decimal Classification (DDC), which is named after its creator, Melvil Dewey.Originally, Dewey designed the classification for libraries, but in the meantime DDChas also been discovered for classifying the web or other resources The DDC is used,among others, because it has a long-standing tradition and is still up to date: in order
to cope with scientific progress, it is currently under development by a ten-memberinternational board (the Editorial Policy Committee, EPC) While the first edition,which was published in 1876, only comprised a few pages, the current 22nd edition
of the DDC spans a four-volume work with almost 4,000 pages Today, the DDCcontains approx 48,000 DDC notations and about 8,000 instructions The DDC no-tations are enumerated in the schedules and tables of the DDC With the aid of theinstructions mentioned above, human classifiers can build new synthesized notations(numbers) if these are not specifically listed in the DDC schedules This way, anenormous amount of synthesized DDC notations has been built intellectually over
Trang 82 DDC notations
Notations play an important role in the DDC:
"Notation is the system of symbols used to represent the classes in a tion system The notation provides a universal language to identify the class andrelated classes, regardless of the fact that different words or languages may be used
classifica-to describe the class." (http://www.oclc.org/dewey/versions/ddc22/intro.pdf)The following picture serves as an example for the aforesaid Class C is rep-resented by the notation 025.43 or, respectively, by the captions of three differentlanguages:
Fig 1 Class C represented by notation 025.43 or by several captions
In compliance with the DDC system, the automatic analysis of notations of the
DDC is carried out in the VZG (VerbundZentrale des Gemeinsamen bundes) project Colibri (COntext generation and LInguistic tools for Bibliographic Retrieval Interfaces) The goal of this project is to enrich title records on the basis of
Bibliotheksver-the DDC to improve retrieval The analysis of DDC notations is conducted under Bibliotheksver-thefollowing research questions (which are also posed in a similar way in Liu (1993),
p 18): Q1 Is it possible to automatically decompose molecular DDC notations into
Trang 9Automatic Analysis of Dewey Decimal Classification Notations 699atomic DDC notations? Q2 Is it possible to improve automatic classification andretrieval by means of atomic DDC notations? An atomic DDC notation is a semanti-cally indecomposable string (of symbols) that represents a DDC class A molecularDDC notation is a string that is syntactically decomposable into atomic DDC nota-tions.
DDC notations can be found at several places in the DDC In DDC summaries,the notations for the main classes (or tens), the divisions (or hundreds), and thesections (or thousands) are enumerated Other notations are listed in the schedules("DDC schedule notations") or tables ("DDC table notations") or internal tables.DDC schedules are "the series of DDC numbers 000-999, their headings (captions),and notes." (Mitchell (1996), p lxv) A DDC table is "a table of numbers that may beadded to other numbers to make a class number appropriately specific to the work be-ing classified" (Mitchell (1996), p lxv) Further notations are contained in the "Rel-ative Index" of the DDC The frequency distributions of schedule (table) notationsare shown in Fig 2 (Fig 3), while schedno0 is short hand for DDC schedule nota-tions beginning with 0, schedno1 for DDC schedule notations beginning with 1, etc.The captions for the main classes are: 000: Computer science, information & gen-eral works; 100: Philosophy & psychology; 200: Religion; 300: Social sciences; 400:Language; 500: Science; 600: Technology; 700: Arts & recreation; 800: Literature;900: History & geography As illustrated by Fig 2, DDC notations are not distributeduniformly: the most schedule notations can be found in the class "Technology", fol-lowed by the notations in the class "Social sciences" The fewest notations belong
to the class "Philosophy & psychology" With regard to the table notations (Fig 3),the 7,816 Table 2 notations ("Geographic Areas, Historical Periods, Persons") standout, whereas, in contrast, the quantities of all other table notations are comparativelysmall (Table 1: Standard Subdivisions; Table 3: Subdivisions for the Arts, for Indi-vidual Literatures, for Specific Literary Forms; Table 4: Subdivisions of IndividualLanguages and Language Families; Table 5: Ethnic and National Groups; Table 6:Languages)
As mentioned before, DDC notations that are not explicitly listed in the schedulescan be built by using DDC instructions This process is called "notational synthesis"
or "number building" Its results are synthesized DDC notations (molecular DDCnotations) that usually only DDC experts are able to interpret But with the aid ofour computer program "DDC analyzer", the meaning of molecular DDC notations
is revealed and the determined atomic DDC notations can be used, among others, toanswer question Q2
3 Automatic analysis of DDC notations
The GBV Union Catalog GVK (Gemeinsamer VerbundKatalog, http://gso.
gbv.de/) contains 3,073,423 intellectually DDC-classified title records (status: July,2004) After the automatic elimination of segmentation marks, obviously incorrectDDC notations (3.8 per cent of all DDC notations), and duplicate DDC notations, atotal of 466,134 different DDC notations is available for the automatic analysis of
Trang 10700 Ulrike Reiner
Fig 2 Frequency distribution of DDC schedule notations
Fig 3 Frequency distribution of DDC table notations
DDC notations This set of all GVK DDC notations serves as input data for the DDCanalyzer The frequency of DDC schedule notations is as follows (in descendingorder): those beginning with 3 (189,246), with 9 (62,115), with 7 (52,632), with 6(51,704), with 5 (33,649), with 0 (23,946), with 2 (20,888), with 8 (20,678), with 4(6,680), and with 1 (4,596) The arity of DDC notations of all GVK DDC notations
Trang 11Automatic Analysis of Dewey Decimal Classification Notations 701
is Gaussian distributed with a maximum at 10, i.e most DDC notations have approx.arity 10, the shortest DDC notation has arity 1, the longest DDC notation has arity
29 Other important input data for the DDC analyzer we used were the 600 DDCnumbers given in Liu’s dissertation These 600 DDC numbers that we call "Liu’ssample" were randomly selected from class 700 from the OCLC database by Liu
As a member of the Consortium DDC German, we have access to the readable data of the 22nd edition of the DDC system These data are stored in an xmlfile The English electronic web version is available as WebDewey
machine-(http://connexion.oclc.org/), the German pendant as MelvilClass deutsch.de/melvilclass-login) For our purpose, only the relevant data of the xml file,which contains the expert knowledge of the DDC system, are extracted and stored
(http://services.ddc-in a "knowledge base" Here, DDC notations, descriptors, and descriptor values arestored in consecutive fields, while facts and rules – as we call them – are represented
in a very similar way:
"Electronic resources" ’#’ serves as field separator The xml tags that are given inangle brackets stand for: "ba4" ("beginning of add table (all of table number)"), "na1"("add note (part of schedule number)") and "hat" ("hierarchy at class") "r1" and "r2",which follow "na1" or, respectively, "ba4", stand for the first two macro rules Theknowledge base contains 48,067 facts and 8,033 rules The 8,033 rules can be gener-alized to macro rules While Liu (1993) defined 17 (macro) rules for the decomposi-tion for class 700, we defined 25 macro rules for all DDC classes
Our program, the DDC analyzer, works as follows: after initializing variables, itreads the knowledge base and, triggered by one or more DDC notations to be an-alyzed, executes the analysis algorithm The number of correct and incorrect DDCnotations is counted For a DDC notation, there are two phases to the analyzing pro-cess including: determining the facts from left to right (phase 1) and determiningthe facts via rules from left to right (phase 2) After checking which output for-mat has to be printed, the result is printed as a DDC analysis diagram or as a DDCanalysis result set After all DDC notations have been analyzed, the number of to-tally/partially analyzed DDC notations is printed There are different reasons for apartially analyzed DDC notation: either the implementation of the DDC analyzer isincorrect/incomplete or the DDC notation is incorrectly synthesized or a part of theDDC system itself is incorrect
Trang 12The title of this book is:
#aVoices in architectural education: #bcultural politics and
The subject headings for this book are:
#aArchitecture #xStudy and teaching #zUnited States
#aArchitecture and state #zUnited States
.-7- North America <na4r7span:T1–0701-T1–0709:T2–7>
.-73 United States <na4r7span:T1–0701-T1–0709:T2–73>
The information given in angle brackets should be read as follows: "hatzen" is theconcatenation of "hat" ("hierarchy at class") and "zen" ("zen built entry (main tag)")
"T1–" stands for "table 1", "T2–" for "table 2", "na4" for "add note (add of tablenumber)", "r7" for "macro rule 7", "span" for "span of numbers", and ":" for "delim-iter" As you can see, while Liu decomposes the synthesized DDC notation into threechunks, our DDC analysis diagram shows the finest possible analysis of the molecu-lar DDC notation The fine analysis provides the advantage of uncovering additionalcaptions: "Arts & recreation", "Architecture", "North America", and "Education, re-search, related topics"
A DDC analysis diagram contains analysis and synthesis information: 1 themolecular DDC notation to be analyzed; 2 an identifier (name) and the length ofthe molecular DDC notation; 3 the sequence and position of the digits within themolecular DDC notation; 4 the Dewey dot at position 4; 5 the relevant parts ofthe molecular DDC notation for each analysis step; 6 the corresponding caption forevery atomic DDC notation; 7 the parts irrelevant for the respective analysis stepmarked with "-"; 8 the type of the applied facts and rules that appear in angle brack-ets In case it has been explained how to read the given information mentioned in 8.,every synthesis step can be reproduced While DDC analysis diagrams are intendedfor human experts, the DDC analysis result set can be used for data transfer Cur-
... for & #34 ;table 2& #34 ;, & #34 ;na4& #34 ; for & #34 ;add note (add of tablenumber)& #34 ;, & #34 ;r7& #34 ; for & #34 ;macro rule 7& #34 ;, & #34 ;span& #34 ; for & #34 ;span of numbers& #34 ;, and & #34 ;:& #34 ; for & #34 ;delim-iter& #34 ;... & #34 ;na1& #34 ;(& #34 ;add note (part of schedule number)& #34 ;) and & #34 ;hat& #34 ; (& #34 ;hierarchy at class& #34 ;) & #34 ;r1& #34 ; and & #34 ;r2& #34 ;,which follow & #34 ;na1& #34 ; or, respectively, & #34 ;ba4& #34 ;,... & #34 ;hatzen& #34 ; is theconcatenation of & #34 ;hat& #34 ; (& #34 ;hierarchy at class& #34 ;) and & #34 ;zen& #34 ; (& #34 ;zen built entry (main tag)& #34 ;)& #34 ;T1–& #34 ; stands for & #34 ;table 1& #34 ;, & #34 ;T2–& #34 ;