Data Analysis Machine Learning and Applications Episode 1 Part 6 docx

We showthat extending object identification by pairwise constraints results in an expressive frameworkthat subsumes many variants of the integration problem like traditional object ident

Trang 1

152 Kurt Hornik and Walter Böhm

Table 2 Formation of a third class in the Euclidean consensus partitions for the Gordon-Vichi

macroeconomic ensemble as a function of the weight ratio w between 3- and 2-class partitions

in the ensemble

1.5 India

2.0 India, Sudan

3.0 India, Sudan

4.5 India, Sudan, Bolivia, Indonesia

10.0 India, Sudan, Bolivia, Indonesia

12.5 India, Sudan, Bolivia, Indonesia, Egypt

f India, Sudan, Bolivia, Indonesia, Egypt

these, 85 female undergraduates at Rutgers University were asked to sort 15 Englishterms into classes “on the basis of some aspect of meaning” There are at least three

“axes” for classification: gender, generation, and direct versus indirect lineage TheEuclidean consensus partitions with Q = 3 classes put grandparents and grandchil-dren in one class and all indirect kins into another one For Q = 4, {brother, sister}are separated from {father, mother, daughter, son} Table 3 shows the membershipsfor a soft Euclidean consensus partition for Q = 5 based on 1000 replications of the

AO algorithm

Table 3 Memberships for the 5-class soft Euclidean consensus partition for the

Rosenberg-Kim kinship terms data

grandfather 0.000 0.024 0.012 0.965 0.000grandmother 0.005 0.134 0.016 0.840 0.005granddaughter 0.113 0.242 0.054 0.466 0.125grandson 0.134 0.111 0.052 0.581 0.122brother 0.612 0.282 0.024 0.082 0.000sister 0.579 0.391 0.026 0.002 0.002father 0.099 0.546 0.122 0.158 0.075mother 0.089 0.654 0.136 0.054 0.066daughter 0.000 1.000 0.000 0.000 0.000son 0.031 0.842 0.007 0.113 0.007nephew 0.012 0.047 0.424 0.071 0.447niece 0.000 0.129 0.435 0.000 0.435cousin 0.080 0.056 0.656 0.033 0.174aunt 0.000 0.071 0.929 0.000 0.000uncle 0.000 0.000 0.882 0.071 0.047

Figure 1 indicates the classes and margins for the 5-class solutions We see thatthe memberships of ‘niece’ are tied between columns 3 and 5, and that the margin

of ‘nephew’ is only very small (0.02), suggesting the 4-class solution as the optimalEuclidean consensus representation of the ensemble

Trang 2

uncle aunt cousin niece nephew son daughter mother father sister brother grandson granddaughter grandmother grandfather

0.0 0.2 0.4 0.6 0.8 1.0

4 4 4 4 1 1 2 2 2 2 5 3/5 3 3 3

Fig 1 Classes (incicated by plot symbol and class id) and margins (differences between the

largest and second largest membership values) for the 5-class soft Euclidean consensus tion for the Rosenberg-Kim kinship terms data

parti-Quite interestingly, none of these consensus partitions split according to gender,even though there are such partitions in the data To take the natural heterogene-ity in the data into account, one could try to partition them (perform clusterwise

aggregation, Gaul and Schader (1988)), resulting in meta-partitions (Gordon and

Vichi (1998)) of the underlying objects Function cl_pclust in package clue

pro-vides an AO heuristic for soft prototype-based partitioning of classifications, ing in particular to obtain soft or hard meta-partitions with soft or hard Euclideanconsensus partitions as prototypes

allow-References

BARTHÉLEMY, J.P and MONJARDET, B (1981): The median procedure in cluster analysis

and social choice theory Mathematical Social Sciences, 1, 235–267.

BARTHÉLEMY, J.P and MONJARDET, B (1988): The median procedure in data analysis:

new results and open problems In: H H Bock, editor, Classification and related methods

of data analysis North-Holland, Amsterdam, 309–316.

BOORMAN, S A and ARABIE, P (1972): Structural measures and the method of sorting

In R N Shepard, A K Romney and S B Nerlove, editors, Multidimensional Scaling: Theory and Applications in the Behavioral Sciences, 1: Theory Seminar Press, New

York, 225–249

CHARON, I., DENOEUD, L., GUENOCHE, A and HUDRY, O (2006): Maximum transfer

distance between partitions Journal of Classification, 23(1), 103–121.

DAY, W H E (1981): The complexity of computing metric distances between partitions

Mathematical Social Sciences, 1, 269–287.

DIMITRIADOU, E., WEINGESSEL, A and HORNIK, K (2002): A combination scheme for

fuzzy clustering International Journal of Pattern Recognition and Artificial Intelligence,

16(7), 901–912

GAUL, W and SCHADER, M (1988): Clusterwise aggregation of relations Applied tic Models and Data Analysis, 4, 273–282.

Trang 3

Stochas-154 Kurt Hornik and Walter Böhm

GORDON, A D and VICHI, M (1998): Partitions of partitions Journal of Classification,

15, 265–285

GORDON, A D and VICHI, M (2001): Fuzzy partition models for fitting a set of partitions

Psychometrika, 66(2), 229–248.

GUSFIELD, D (2002): Partition-distance: A problem and class of perfect graphs arising in

clustering Information Processing Letters, 82, 159–164.

HORNIK, K (2005a): A CLUE for CLUster Ensembles Journal of Statistical Software,

HORNIK, K (2007a): clue: Cluster Ensembles R package version 0.3-12.

HORNIK, K (2007b): On maximal euclidean partition dissimilarity Under preparation.HORNIK, K and BÖHM, W (2007): Alternating optimization algorithms for Euclidean andManhattan consensus partitions Under preparation

MIRKIN, B.G (1974): The problem of approximation in space of relations and qualitative

data analysis Automatika y Telemechanika, translated in: Information and Remote trol, 35, 1424–1438.

Con-PAPADIMITRIOU, C and STEIGLITZ, K (1982): Combinatorial Optimization: Algorithms and Complexity Prentice Hall, Englewood Cliffs.

ROSENBERG, S (1982): The method of sorting in multivariate research with applicationsselected from cognitive psychology and person perception In N Hirschberg and L G

Humphreys, editors, Multivariate Applications in the Social Sciences Erlbaum,

Hills-dale, New Jersey, 117–142

ROSENBERG, S and KIM, M P (1975): The method of sorting as a data-gathering procedure

in multivariate research Multivariate Behavioral Research, 10, 489–502.

RUBIN, J (1967): Optimal classification into groups: An approach for solving the taxonomy

problem Journal of Theoretical Biology, 15, 103–144.

WAKABAYASHI, Y (1998): The complexity of computing median relations Resenhas do Instituto de Mathematica ed Estadistica, Universidade de Sao Paolo, 3/3, 323–349.

ZHOU, D., LI, J and ZHA, H (2005): A new Mallows distance based metric for comparing

clusterings In ICML ’05: Proceedings of the 22nd International Conference on Machine Learning ISBN 1-59593-180-5 ACM Press, New York, NY, USA, 1028–1035.

Trang 4

Labeled Data

Steffen Rendle and Lars Schmidt-ThiemeInformation Systems and Machine Learning Lab, University of Hildesheim

{srendle, schmidt-thieme}@ismll.uni-hildesheim.de

Abstract A central task when integrating data from different sources is to detect identical

items For example, price comparison websites have to identify offers for identical products.This task is known, among others, as record linkage, object identification, or duplicate detec-tion

In this work, we examine problem settings where some relations between items are given

in advance – for example by EAN article codes in an e-commerce scenario or by manuallylabeled parts To represent and solve these problems we bring in ideas of semi-supervised andconstrained clustering in terms of pairwise must-link and cannot-link constraints We showthat extending object identification by pairwise constraints results in an expressive frameworkthat subsumes many variants of the integration problem like traditional object identification,matching, iterative problems or an active learning setting

For solving these integration tasks, we propose an extension to current object identificationmodels that assures consistent solutions to problems with constraints Our evaluation showsthat additionally taking the labeled data into account dramatically increases the quality ofstate-of-the-art object identification systems

1 Introduction

When information collected from many sources should be integrated, different jects may refer to the same underlying entity Object identification aims at identifyingsuch equivalent objects A typical scenario is a price comparison system where offersfrom different shops are collected and identical products have to be found Decisionsabout identities are based on noisy attributes like product names or brands More-over, often some parts of the data provide some kind of label that can additionally

ob-be used For example some offers might ob-be laob-beled by a European Article Numob-ber(EAN) or an International Standard Book Number (ISBN) In this work we investi-gate problem settings where such information is provided on some parts of the data

We will present three different kinds of knowledge that restricts the set of consistentsolutions For solving these constrained object identification problems we extend thegeneric object identification model by a collective decision model that is guided byboth constraints and similarities

Trang 5

172 Steffen Rendle and Lars Schmidt-Thieme

2 Related work

Object identification (e.g Neiling 2005) is also known as record linkage (e.g kler 1999) and duplicate detection (e.g Bilenko and Mooney 2003) State-of-the-artmethods use an adaptive approach and learn a similarity measure that is used forpredicting the equivalence relation (e.g Cohen and Richman 2002) In contrast, ourapproach also takes labels in terms of constraints into account

Win-Using pairwise constraints for guiding decisions is studied in the community ofsemi-supervised or constrained clustering – e.g Basu et al (2004) However, theproblem setting in object identification differs from this scenario because in semi-supervised clustering typically a small number of classes is considered and often it isassumed that the number of classes is known in advance Moreover, semi-supervisedclustering does not use expensive pairwise models that are common in object identi-fication

3 Four problem classes

In the classical object identification problemCclassic a set of objects X should be grouped into equivalence classes E X In an adaptive setting, a second set Y of objects

is available where the perfect equivalence relation E Y is known It is assumed that X and Y are disjoint and share no classes – i.e E X ∩ E Y =

In real world problems often there is no such clear separation between labeled

and unlabeled data Instead only the objects of some subset Y of X are labeled We

call this problem setting the iterative problemCiter where(X,Y,E Y) is given with

X ⊇ Y and Y2⊇ E Y Obviously, consistent solutions E X have to satisfy E X ∩Y2= E Y.Examples of applications for iterative problems are the integration of offers fromdifferent sources where some offers are labeled by a unique identifier like an EAN

or ISBN, and iterative integration tasks where an already integrated set of objects isextended by new objects

The third problem setting deals with integrating data from n sources, where each

source is assumed to contain no duplicates at all This is called the class of matchingproblemsCmatch Here the problem is given byX = {X1, , X n } with X i ∩ X j= and the set of consistent equivalence relationsE is restricted to relations E on X with E ∩ X2

i = {(x,x)|x ∈ X i } Traditional record linkage often deals with matching problems of two data sets (n= 2)

At last, there is the class of pairwise constrained problems Cconstr Here eachproblem is defined by(X,R ml , R cl ) where the set of objects X is constrained by a must-link R ml and a cannot-link relation R cl Consistent solutions are restricted to

equivalence releations E with E ∩ R cl = and E ⊇ R ml Obviously, R clis

symmet-ric and irreflexive whereas R ml has to be an equivalence relation In all, pairwiseconstrained problems differ from iterative problems by labeling relations instead oflabeling objects The constrained problem class can better describe local informa-tions like two offers are the same/ different Such information can for example beprovided by a human expert in an active learning setting

Trang 6

Fig 1 Relations between problem classes:C classic ⊂ C iter ⊂ C constrandC classic ⊂ C match ⊂

C constr

We will show, that the presented problem classes form a hierarchy Cclassic ⊂

Citer ⊂Cconstrand thatCclassic ⊂Cmatch ⊂Cconstrbut neitherCmatch ⊆CiternorCiter ⊆

Cmatch (see Figure 1) First of all, it is easy to see that Cclassic ⊆Citer because any

problem X ∈Cclassic corresponds to an iterative problem without labeled data (Y =

) Also Cclassic ⊆Cmatch because an arbitrary problem X ∈Cclassic can be

trans-formed to a matching problem by considering each object as its own dataset: X1=

{x1}, ,X n = {x n } On the other hand,Citer ⊆CclassicandCmatch ⊆Cclassic, because

Cclassic is not able to formulate any restriction on the set of possible solutionsE asthe other classes can do This shows that:

Cclassic ⊂Cmatch , Cclassic ⊂Citer (1)Next we will show thatCiter ⊂Cconstr First of all, any iterative problem(X,Y,E Y)can be transformed to a constrained problem (X,R ml , R cl) by setting

R ml ← {(y1, y2)|y1≡ E Y y2} and R cl ← {(y1, y2)|y1≡ E Y y2} On the other hand, there

are problems(X,R ml , R cl ) ∈Cconstrthat cannot be expressed as an iterative problem,e.g.:

X = {x1, x2, x3, x4}, R ml = {(x1, x2),(x3, x4)}, R cl=

If one tries to express this as an iterative problem, one would assign to the pair(x1, x2)

the label l1and to(x3, x4) the label l2 But one has to decide whether or not l1= l2

If l1= l2, then the corresponding constrained problem would include the constraint

(x2, x3) ∈ R ml , which differs from the original problem Otherwise, if l1= l2, thiswould imply(x2, x3) ∈ R cl, which again is a different problem Therefore:

Citer ⊂Cconstr (2)Furthermore, Cmatch ⊆Cconstr because any matching problem X1, , X ncan beexpressed as a constrained problem with:

Trang 7

Cmatch ⊆Citer , Citer ⊆Cmatch (4)

In all we have shown thatCconstris the most expressive class and subsumes allthe other classes

4 Method

Object Identification is generally done by three core components (Rendle and Thieme (2006)):

Schmidt-1 Pairwise Feature Extraction with a function f : X2→ R n

2 Probabilistic Pairwise Decision Model specifying probabilities for equivalences

P [x ≡ y].

3 Collective Decision Model generating an equivalence relation E over X.

The task of feature extraction is to generate a feature vector from the attribute scriptions of any two objects Mostly, heuristic similarity functions like TFIDF-Cosine-Similarity or Levenshtein distance are used The probabilistic pairwise deci-sion model combines several of these heuristic functions to a single domain specificsimilarity function (see Table 1) For this model probabilistic classifiers like SVMs,decision trees, logic regression, etc can be used By combining many heuristic func-tions over several attributes, no time-consuming function selection and fine-tuninghas to be performed by a domain-expert Instead, the model automatically learnswhich similarity function is important for a specific problem Cohen and Richman(2002) as well as Bilenko and Mooney (2003) have shown that this approach is suc-

de-cessful The collective decision model generates an equivalence relation over X by using sim (x,y) := P[x ≡ y] as learned similarity measure Often, clustering is used

for this task (e.g Cohen and Richman (2002))

Trang 8

Table 1 Example of feature extraction and prediction of pairwise equivalence P [x i ≡ x j] forthree digital cameras.

x1 Hewlett Packard Photosmart 435 Digital Camera 118.99

4.1 Collective decision model with constraints

The constrained problem easily fits into the generic model above by extending thecollective decision model by constraints As this stage might be solved by clusteringalgorithms in the classical problem, we propose to solve the constrained problem by aconstraint-based clustering algorithm To enforce the constraint satisfaction we sug-gest a constrained hierarchical agglomerative clustering (HAC) algorithm Instead

of a dendrogram the algorithm builds a partition where each cluster should containequivalent objects Because in an object identification task the number of equivalenceclasses is almost never known, we suggest model selection by a (learned) threshold

T on the similarity of two clusters in order to stop the merging process A simplifiedrepresentation of our constrained HAC algorithm is shown in Algorithm 1 The al-gorithm initially creates a new cluster for each object (line 2) and afterwards mergesclusters that contain objects constrained by a mustlink (line 3-7) Then the most sim-ilar clusters, that are not constrained by a cannotlink, are merged until the threshold

T is reached

From a theoretical point of view this task might be solved by an arbitrary, abilistic HAC algorithm using a special initialization of the similarity matrix andminor changes in the update step of the matrix For satisfaction of the constraints

prob-R ml and R cl , one initializes the similarity matrix for X = {x1, , x n } in the following

As usual, in each iteration the two clusters with the highest similarity are merged

After merging cluster c l with c m the dimension of the square matrix A reduces by

one – both in columns and rows For ensuring constraint satisfaction, the similarities

between c l ∪ c mto all the other clusters have to be recomputed:

Trang 9

For calculating the similarity sim between clusters, standard linkage techniques

like single-, complete- or average-linkage can be used

Algorithm 1 Constrained HAC Algorithm

1: procedure CLUSTERHAC(X, R ml , R cl)

sim sl (c1∪ c2, c3) = max{sim sl (c1, c3),sim sl (c2, c3)} single-linkage sim cl (c1∪ c2, c3) = min{sim cl (c1, c3),sim cl (c2, c3)} complete-linkage sim al (c1∪ c2, c3) = | c1| · sim al (c1, c3) + |c2| · sim al (c2, c3)

Second, a blocker should reduce the number of pairs that have to be taken intoaccount for merging Blockers like the canopy blocker (McCallum et al (2000))

Trang 10

Table 2 Comparison of F-Measure quality of a constrained to a classical method with different

linkage techniques For each data set and each method the best linkage technique is markedbold

Data Set Method Single Linkage Complete Linkage Average LinkageCora classic/constrained 0.70/0.92 0.74/0.71 0.89/0.93

DVD player classic/constrained 0.87/0.94 0.79/0.73 0.86/0.95

Camera classic/constrained 0.65/0.86 0.60/0.45 0.67/0.81

reduce the amount of pairs very efficiently, so even large data sets can be handled

At last, pruning should be applied to eliminate cluster pairs with similarity below

Tprune These optimizations can be implemented by storing a list of pairs which is initialized with the pruned candidate pairs of the blocker

cluster-distance-5 Evaluation

In our evaluation study we examine if additionally guiding the collective decisionmodel by constraints improves the quality Therefore we compare constrained andunconstrained versions of the same object identification model on different data sets

As data sets we use the bibliographic Cora dataset that is provided by McCallum et al.(2000) and is widely used for evaluating object identification models (e.g Cohen et

al (2002) and Bilenko et al (2003)), and two product data sets of a price comparisonsystem

We set up an iterative problem by labeling N% of the objects with their true classlabel For feature extraction of the Cora model we use TFIDF-Cosine-Similarity,Levenshtein distance and Jaccard distance for every attribute The model for theproduct datasets uses TFIDF-Cosine-Similarity, the difference between prices andsome domain-specific comparison functions The pairwise decision model is chosen

to be a Support Vector Machine In the collective decision model we run our strained HAC algorithm against an unconstrained (‘classic’) one In each case, werun three different linkage methods: single-, complete- and average-linkage We re-port the average F-Measure quality of four runs for each of the linkage techniquesand for constrained and unconstrained clustering The F-Measure quality is taken onall pairs that are unknown in advance – i.e pairs that do not link two labeled objects

con-F-Measure=2 · Recall · Precision

for Cora and N= 50% for the product datasets provide labels As one can see, thebest constrained method always clearly outperforms the best classical method Whenswitching from the best classical to the best constrained method, the relative errorreduces by 36% for Cora, 62% for DVD-Player and 58% for Camera An informal

Trang 11

Fig 2 F-Measure on Camera dataset for varying proportions of labeled objects.

significance test shows that in this experiment the best constrained method is betterthan the best classic one

In a second experiment (see Figure 2) we increased the amount of labeled data

from N = 10% to N = 60% and report results for the Camera dataset for the best

clas-sical method and the three constrained linkage techniques The figure shows that thebest classical method does not improve much beyond more than 20% labeled data Incontrast, when using the constrained single- or average-linkage technique the quality

on non-labeled parts improves always with more labeled data When few constraintsare available average-linkage tends to be better than single-linkage whereas single-linkage is superior in the case of many constraints The reason are the cannot-linksthat prevent single-linkage from merging false pairs The bad performance of con-strained complete-linkage can be explained by must-link constraints that might result

in diverse clusters (Algorithm 1, line 3-7) For any diverse cluster, complete-linkagecan not find any cluster with similarity greater than T and so after the initial step,diverse clusters are not merged any more (Algorithm 1, line 8-13)

or average-linkage is effective and using constraints in the collective stage clearlyoutperforms non-constrained state-of-the-art methods

Trang 12

BASU, S and BILENKO, M and MOONEY, R J (2004): A Probabilistic Framework for

Semi-Supervised Clustering In: Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (KDD-2004).

BILENKO, M and MOONEY, R J (2003): Adaptive Duplicate Detection Using

Learn-able String Similarity Measures In: Proceedings of the 9th International Conference

on Knowledge Discovery and Data Mining (KDD-2004).

COHEN, W W and RICHMAN, J (2002): Learning to Match and Cluster Large

High-Dimensional Data Sets for Data Integration In: Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining (KDD-2002).

MCCALLUM, A K., NIGAM K and UNGAR L (2000): Efficient Clustering of

High-Dimensional Data Sets with Application to Reference Matching In: Proceedings of the 6th International Conference On Knowledge Discovery and Data Mining (KDD-2000) NEILING, M (2005): Identification of Real-World Objects in Multiple Databases In: Pro- ceedings of GfKl Conference 2005.

RENDLE, S and SCHMIDT-THIEME, L (2006): Object Identification with Constraints In:

Proceedings of 6th IEEE International Conference on Data Mining (ICDM-2006) WINKLER W E (1999): The State of Record Linkage and Current Research Problems Tech-

nical report, Statistical Research Division, U.S Census Bureau

Tiêu đề	Formation of a Third Class in the Euclidean Consensus Partitions for the Gordon-Vichi Macroeconomic Ensemble
Tác giả	Kurt Hornik, Walter Bửhm
Trường học	Rutgers University
Thể loại	Essay

Định dạng
Số trang	25
Dung lượng	454,96 KB