We showthat extending object identification by pairwise constraints results in an expressive frameworkthat subsumes many variants of the integration problem like traditional object ident
Trang 1152 Kurt Hornik and Walter Böhm
Table 2 Formation of a third class in the Euclidean consensus partitions for the Gordon-Vichi
macroeconomic ensemble as a function of the weight ratio w between 3- and 2-class partitions
in the ensemble
1.5 India
2.0 India, Sudan
3.0 India, Sudan
4.5 India, Sudan, Bolivia, Indonesia
10.0 India, Sudan, Bolivia, Indonesia
12.5 India, Sudan, Bolivia, Indonesia, Egypt
f India, Sudan, Bolivia, Indonesia, Egypt
these, 85 female undergraduates at Rutgers University were asked to sort 15 Englishterms into classes “on the basis of some aspect of meaning” There are at least three
“axes” for classification: gender, generation, and direct versus indirect lineage TheEuclidean consensus partitions with Q = 3 classes put grandparents and grandchil-dren in one class and all indirect kins into another one For Q = 4, {brother, sister}are separated from {father, mother, daughter, son} Table 3 shows the membershipsfor a soft Euclidean consensus partition for Q = 5 based on 1000 replications of the
AO algorithm
Table 3 Memberships for the 5-class soft Euclidean consensus partition for the
Rosenberg-Kim kinship terms data
grandfather 0.000 0.024 0.012 0.965 0.000grandmother 0.005 0.134 0.016 0.840 0.005granddaughter 0.113 0.242 0.054 0.466 0.125grandson 0.134 0.111 0.052 0.581 0.122brother 0.612 0.282 0.024 0.082 0.000sister 0.579 0.391 0.026 0.002 0.002father 0.099 0.546 0.122 0.158 0.075mother 0.089 0.654 0.136 0.054 0.066daughter 0.000 1.000 0.000 0.000 0.000son 0.031 0.842 0.007 0.113 0.007nephew 0.012 0.047 0.424 0.071 0.447niece 0.000 0.129 0.435 0.000 0.435cousin 0.080 0.056 0.656 0.033 0.174aunt 0.000 0.071 0.929 0.000 0.000uncle 0.000 0.000 0.882 0.071 0.047
Figure 1 indicates the classes and margins for the 5-class solutions We see thatthe memberships of ‘niece’ are tied between columns 3 and 5, and that the margin
of ‘nephew’ is only very small (0.02), suggesting the 4-class solution as the optimalEuclidean consensus representation of the ensemble
Trang 2uncle aunt cousin niece nephew son daughter mother father sister brother grandson granddaughter grandmother grandfather
0.0 0.2 0.4 0.6 0.8 1.0
4 4 4 4 1 1 2 2 2 2 5 3/5 3 3 3
Fig 1 Classes (incicated by plot symbol and class id) and margins (differences between the
largest and second largest membership values) for the 5-class soft Euclidean consensus tion for the Rosenberg-Kim kinship terms data
parti-Quite interestingly, none of these consensus partitions split according to gender,even though there are such partitions in the data To take the natural heterogene-ity in the data into account, one could try to partition them (perform clusterwise
aggregation, Gaul and Schader (1988)), resulting in meta-partitions (Gordon and
Vichi (1998)) of the underlying objects Function cl_pclust in package clue
pro-vides an AO heuristic for soft prototype-based partitioning of classifications, ing in particular to obtain soft or hard meta-partitions with soft or hard Euclideanconsensus partitions as prototypes
allow-References
BARTHÉLEMY, J.P and MONJARDET, B (1981): The median procedure in cluster analysis
and social choice theory Mathematical Social Sciences, 1, 235–267.
BARTHÉLEMY, J.P and MONJARDET, B (1988): The median procedure in data analysis:
new results and open problems In: H H Bock, editor, Classification and related methods
of data analysis North-Holland, Amsterdam, 309–316.
BOORMAN, S A and ARABIE, P (1972): Structural measures and the method of sorting
In R N Shepard, A K Romney and S B Nerlove, editors, Multidimensional Scaling: Theory and Applications in the Behavioral Sciences, 1: Theory Seminar Press, New
York, 225–249
CHARON, I., DENOEUD, L., GUENOCHE, A and HUDRY, O (2006): Maximum transfer
distance between partitions Journal of Classification, 23(1), 103–121.
DAY, W H E (1981): The complexity of computing metric distances between partitions
Mathematical Social Sciences, 1, 269–287.
DIMITRIADOU, E., WEINGESSEL, A and HORNIK, K (2002): A combination scheme for
fuzzy clustering International Journal of Pattern Recognition and Artificial Intelligence,
16(7), 901–912
GAUL, W and SCHADER, M (1988): Clusterwise aggregation of relations Applied tic Models and Data Analysis, 4, 273–282.
Trang 3Stochas-154 Kurt Hornik and Walter Böhm
GORDON, A D and VICHI, M (1998): Partitions of partitions Journal of Classification,
15, 265–285
GORDON, A D and VICHI, M (2001): Fuzzy partition models for fitting a set of partitions
Psychometrika, 66(2), 229–248.
GUSFIELD, D (2002): Partition-distance: A problem and class of perfect graphs arising in
clustering Information Processing Letters, 82, 159–164.
HORNIK, K (2005a): A CLUE for CLUster Ensembles Journal of Statistical Software,
HORNIK, K (2007a): clue: Cluster Ensembles R package version 0.3-12.
HORNIK, K (2007b): On maximal euclidean partition dissimilarity Under preparation.HORNIK, K and BÖHM, W (2007): Alternating optimization algorithms for Euclidean andManhattan consensus partitions Under preparation
MIRKIN, B.G (1974): The problem of approximation in space of relations and qualitative
data analysis Automatika y Telemechanika, translated in: Information and Remote trol, 35, 1424–1438.
Con-PAPADIMITRIOU, C and STEIGLITZ, K (1982): Combinatorial Optimization: Algorithms and Complexity Prentice Hall, Englewood Cliffs.
ROSENBERG, S (1982): The method of sorting in multivariate research with applicationsselected from cognitive psychology and person perception In N Hirschberg and L G
Humphreys, editors, Multivariate Applications in the Social Sciences Erlbaum,
Hills-dale, New Jersey, 117–142
ROSENBERG, S and KIM, M P (1975): The method of sorting as a data-gathering procedure
in multivariate research Multivariate Behavioral Research, 10, 489–502.
RUBIN, J (1967): Optimal classification into groups: An approach for solving the taxonomy
problem Journal of Theoretical Biology, 15, 103–144.
WAKABAYASHI, Y (1998): The complexity of computing median relations Resenhas do Instituto de Mathematica ed Estadistica, Universidade de Sao Paolo, 3/3, 323–349.
ZHOU, D., LI, J and ZHA, H (2005): A new Mallows distance based metric for comparing
clusterings In ICML ’05: Proceedings of the 22nd International Conference on Machine Learning ISBN 1-59593-180-5 ACM Press, New York, NY, USA, 1028–1035.
Trang 4Labeled Data
Steffen Rendle and Lars Schmidt-ThiemeInformation Systems and Machine Learning Lab, University of Hildesheim
{srendle, schmidt-thieme}@ismll.uni-hildesheim.de
Abstract A central task when integrating data from different sources is to detect identical
items For example, price comparison websites have to identify offers for identical products.This task is known, among others, as record linkage, object identification, or duplicate detec-tion
In this work, we examine problem settings where some relations between items are given
in advance – for example by EAN article codes in an e-commerce scenario or by manuallylabeled parts To represent and solve these problems we bring in ideas of semi-supervised andconstrained clustering in terms of pairwise must-link and cannot-link constraints We showthat extending object identification by pairwise constraints results in an expressive frameworkthat subsumes many variants of the integration problem like traditional object identification,matching, iterative problems or an active learning setting
For solving these integration tasks, we propose an extension to current object identificationmodels that assures consistent solutions to problems with constraints Our evaluation showsthat additionally taking the labeled data into account dramatically increases the quality ofstate-of-the-art object identification systems
1 Introduction
When information collected from many sources should be integrated, different jects may refer to the same underlying entity Object identification aims at identifyingsuch equivalent objects A typical scenario is a price comparison system where offersfrom different shops are collected and identical products have to be found Decisionsabout identities are based on noisy attributes like product names or brands More-over, often some parts of the data provide some kind of label that can additionally
ob-be used For example some offers might ob-be laob-beled by a European Article Numob-ber(EAN) or an International Standard Book Number (ISBN) In this work we investi-gate problem settings where such information is provided on some parts of the data
We will present three different kinds of knowledge that restricts the set of consistentsolutions For solving these constrained object identification problems we extend thegeneric object identification model by a collective decision model that is guided byboth constraints and similarities
Trang 5172 Steffen Rendle and Lars Schmidt-Thieme
2 Related work
Object identification (e.g Neiling 2005) is also known as record linkage (e.g kler 1999) and duplicate detection (e.g Bilenko and Mooney 2003) State-of-the-artmethods use an adaptive approach and learn a similarity measure that is used forpredicting the equivalence relation (e.g Cohen and Richman 2002) In contrast, ourapproach also takes labels in terms of constraints into account
Win-Using pairwise constraints for guiding decisions is studied in the community ofsemi-supervised or constrained clustering – e.g Basu et al (2004) However, theproblem setting in object identification differs from this scenario because in semi-supervised clustering typically a small number of classes is considered and often it isassumed that the number of classes is known in advance Moreover, semi-supervisedclustering does not use expensive pairwise models that are common in object identi-fication
3 Four problem classes
In the classical object identification problemCclassic a set of objects X should be grouped into equivalence classes E X In an adaptive setting, a second set Y of objects
is available where the perfect equivalence relation E Y is known It is assumed that X and Y are disjoint and share no classes – i.e E X ∩ E Y =
In real world problems often there is no such clear separation between labeled
and unlabeled data Instead only the objects of some subset Y of X are labeled We
call this problem setting the iterative problemCiter where(X,Y,E Y) is given with
X ⊇ Y and Y2⊇ E Y Obviously, consistent solutions E X have to satisfy E X ∩Y2= E Y.Examples of applications for iterative problems are the integration of offers fromdifferent sources where some offers are labeled by a unique identifier like an EAN
or ISBN, and iterative integration tasks where an already integrated set of objects isextended by new objects
The third problem setting deals with integrating data from n sources, where each
source is assumed to contain no duplicates at all This is called the class of matchingproblemsCmatch Here the problem is given byX = {X1, , X n } with X i ∩ X j= and the set of consistent equivalence relationsE is restricted to relations E on X with E ∩ X2
i = {(x,x)|x ∈ X i } Traditional record linkage often deals with matching problems of two data sets (n= 2)
At last, there is the class of pairwise constrained problems Cconstr Here eachproblem is defined by(X,R ml , R cl ) where the set of objects X is constrained by a must-link R ml and a cannot-link relation R cl Consistent solutions are restricted to
equivalence releations E with E ∩ R cl = and E ⊇ R ml Obviously, R clis
symmet-ric and irreflexive whereas R ml has to be an equivalence relation In all, pairwiseconstrained problems differ from iterative problems by labeling relations instead oflabeling objects The constrained problem class can better describe local informa-tions like two offers are the same/ different Such information can for example beprovided by a human expert in an active learning setting
Trang 6Fig 1 Relations between problem classes:C classic ⊂ C iter ⊂ C constrandC classic ⊂ C match ⊂
C constr
We will show, that the presented problem classes form a hierarchy Cclassic ⊂
Citer ⊂Cconstrand thatCclassic ⊂Cmatch ⊂Cconstrbut neitherCmatch ⊆CiternorCiter ⊆
Cmatch (see Figure 1) First of all, it is easy to see that Cclassic ⊆Citer because any
problem X ∈Cclassic corresponds to an iterative problem without labeled data (Y =
) Also Cclassic ⊆Cmatch because an arbitrary problem X ∈Cclassic can be
trans-formed to a matching problem by considering each object as its own dataset: X1=
{x1}, ,X n = {x n } On the other hand,Citer ⊆CclassicandCmatch ⊆Cclassic, because
Cclassic is not able to formulate any restriction on the set of possible solutionsE asthe other classes can do This shows that:
Cclassic ⊂Cmatch , Cclassic ⊂Citer (1)Next we will show thatCiter ⊂Cconstr First of all, any iterative problem(X,Y,E Y)can be transformed to a constrained problem (X,R ml , R cl) by setting
R ml ← {(y1, y2)|y1≡ E Y y2} and R cl ← {(y1, y2)|y1≡ E Y y2} On the other hand, there
are problems(X,R ml , R cl ) ∈Cconstrthat cannot be expressed as an iterative problem,e.g.:
X = {x1, x2, x3, x4}, R ml = {(x1, x2),(x3, x4)}, R cl=
If one tries to express this as an iterative problem, one would assign to the pair(x1, x2)
the label l1and to(x3, x4) the label l2 But one has to decide whether or not l1= l2
If l1= l2, then the corresponding constrained problem would include the constraint
(x2, x3) ∈ R ml , which differs from the original problem Otherwise, if l1= l2, thiswould imply(x2, x3) ∈ R cl, which again is a different problem Therefore:
Citer ⊂Cconstr (2)Furthermore, Cmatch ⊆Cconstr because any matching problem X1, , X ncan beexpressed as a constrained problem with:
Trang 7174 Steffen Rendle and Lars Schmidt-Thieme
Cmatch ⊆Citer , Citer ⊆Cmatch (4)
In all we have shown thatCconstris the most expressive class and subsumes allthe other classes
4 Method
Object Identification is generally done by three core components (Rendle and Thieme (2006)):
Schmidt-1 Pairwise Feature Extraction with a function f : X2→ R n
2 Probabilistic Pairwise Decision Model specifying probabilities for equivalences
P [x ≡ y].
3 Collective Decision Model generating an equivalence relation E over X.
The task of feature extraction is to generate a feature vector from the attribute scriptions of any two objects Mostly, heuristic similarity functions like TFIDF-Cosine-Similarity or Levenshtein distance are used The probabilistic pairwise deci-sion model combines several of these heuristic functions to a single domain specificsimilarity function (see Table 1) For this model probabilistic classifiers like SVMs,decision trees, logic regression, etc can be used By combining many heuristic func-tions over several attributes, no time-consuming function selection and fine-tuninghas to be performed by a domain-expert Instead, the model automatically learnswhich similarity function is important for a specific problem Cohen and Richman(2002) as well as Bilenko and Mooney (2003) have shown that this approach is suc-
de-cessful The collective decision model generates an equivalence relation over X by using sim (x,y) := P[x ≡ y] as learned similarity measure Often, clustering is used
for this task (e.g Cohen and Richman (2002))
Trang 8Table 1 Example of feature extraction and prediction of pairwise equivalence P [x i ≡ x j] forthree digital cameras.
x1 Hewlett Packard Photosmart 435 Digital Camera 118.99
4.1 Collective decision model with constraints
The constrained problem easily fits into the generic model above by extending thecollective decision model by constraints As this stage might be solved by clusteringalgorithms in the classical problem, we propose to solve the constrained problem by aconstraint-based clustering algorithm To enforce the constraint satisfaction we sug-gest a constrained hierarchical agglomerative clustering (HAC) algorithm Instead
of a dendrogram the algorithm builds a partition where each cluster should containequivalent objects Because in an object identification task the number of equivalenceclasses is almost never known, we suggest model selection by a (learned) threshold
T on the similarity of two clusters in order to stop the merging process A simplifiedrepresentation of our constrained HAC algorithm is shown in Algorithm 1 The al-gorithm initially creates a new cluster for each object (line 2) and afterwards mergesclusters that contain objects constrained by a mustlink (line 3-7) Then the most sim-ilar clusters, that are not constrained by a cannotlink, are merged until the threshold
T is reached
From a theoretical point of view this task might be solved by an arbitrary, abilistic HAC algorithm using a special initialization of the similarity matrix andminor changes in the update step of the matrix For satisfaction of the constraints
prob-R ml and R cl , one initializes the similarity matrix for X = {x1, , x n } in the following
As usual, in each iteration the two clusters with the highest similarity are merged
After merging cluster c l with c m the dimension of the square matrix A reduces by
one – both in columns and rows For ensuring constraint satisfaction, the similarities
between c l ∪ c mto all the other clusters have to be recomputed:
Trang 9176 Steffen Rendle and Lars Schmidt-Thieme
For calculating the similarity sim between clusters, standard linkage techniques
like single-, complete- or average-linkage can be used
Algorithm 1 Constrained HAC Algorithm
1: procedure CLUSTERHAC(X, R ml , R cl)
sim sl (c1∪ c2, c3) = max{sim sl (c1, c3),sim sl (c2, c3)} single-linkage sim cl (c1∪ c2, c3) = min{sim cl (c1, c3),sim cl (c2, c3)} complete-linkage sim al (c1∪ c2, c3) = | c1| · sim al (c1, c3) + |c2| · sim al (c2, c3)
Second, a blocker should reduce the number of pairs that have to be taken intoaccount for merging Blockers like the canopy blocker (McCallum et al (2000))
Trang 10Table 2 Comparison of F-Measure quality of a constrained to a classical method with different
linkage techniques For each data set and each method the best linkage technique is markedbold
Data Set Method Single Linkage Complete Linkage Average LinkageCora classic/constrained 0.70/0.92 0.74/0.71 0.89/0.93
DVD player classic/constrained 0.87/0.94 0.79/0.73 0.86/0.95
Camera classic/constrained 0.65/0.86 0.60/0.45 0.67/0.81
reduce the amount of pairs very efficiently, so even large data sets can be handled
At last, pruning should be applied to eliminate cluster pairs with similarity below
Tprune These optimizations can be implemented by storing a list of pairs which is initialized with the pruned candidate pairs of the blocker
cluster-distance-5 Evaluation
In our evaluation study we examine if additionally guiding the collective decisionmodel by constraints improves the quality Therefore we compare constrained andunconstrained versions of the same object identification model on different data sets
As data sets we use the bibliographic Cora dataset that is provided by McCallum et al.(2000) and is widely used for evaluating object identification models (e.g Cohen et
al (2002) and Bilenko et al (2003)), and two product data sets of a price comparisonsystem
We set up an iterative problem by labeling N% of the objects with their true classlabel For feature extraction of the Cora model we use TFIDF-Cosine-Similarity,Levenshtein distance and Jaccard distance for every attribute The model for theproduct datasets uses TFIDF-Cosine-Similarity, the difference between prices andsome domain-specific comparison functions The pairwise decision model is chosen
to be a Support Vector Machine In the collective decision model we run our strained HAC algorithm against an unconstrained (‘classic’) one In each case, werun three different linkage methods: single-, complete- and average-linkage We re-port the average F-Measure quality of four runs for each of the linkage techniquesand for constrained and unconstrained clustering The F-Measure quality is taken onall pairs that are unknown in advance – i.e pairs that do not link two labeled objects
con-F-Measure=2 · Recall · Precision
for Cora and N= 50% for the product datasets provide labels As one can see, thebest constrained method always clearly outperforms the best classical method Whenswitching from the best classical to the best constrained method, the relative errorreduces by 36% for Cora, 62% for DVD-Player and 58% for Camera An informal
Trang 11178 Steffen Rendle and Lars Schmidt-Thieme
Fig 2 F-Measure on Camera dataset for varying proportions of labeled objects.
significance test shows that in this experiment the best constrained method is betterthan the best classic one
In a second experiment (see Figure 2) we increased the amount of labeled data
from N = 10% to N = 60% and report results for the Camera dataset for the best
clas-sical method and the three constrained linkage techniques The figure shows that thebest classical method does not improve much beyond more than 20% labeled data Incontrast, when using the constrained single- or average-linkage technique the quality
on non-labeled parts improves always with more labeled data When few constraintsare available average-linkage tends to be better than single-linkage whereas single-linkage is superior in the case of many constraints The reason are the cannot-linksthat prevent single-linkage from merging false pairs The bad performance of con-strained complete-linkage can be explained by must-link constraints that might result
in diverse clusters (Algorithm 1, line 3-7) For any diverse cluster, complete-linkagecan not find any cluster with similarity greater than T and so after the initial step,diverse clusters are not merged any more (Algorithm 1, line 8-13)
or average-linkage is effective and using constraints in the collective stage clearlyoutperforms non-constrained state-of-the-art methods
Trang 12BASU, S and BILENKO, M and MOONEY, R J (2004): A Probabilistic Framework for
Semi-Supervised Clustering In: Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (KDD-2004).
BILENKO, M and MOONEY, R J (2003): Adaptive Duplicate Detection Using
Learn-able String Similarity Measures In: Proceedings of the 9th International Conference
on Knowledge Discovery and Data Mining (KDD-2004).
COHEN, W W and RICHMAN, J (2002): Learning to Match and Cluster Large
High-Dimensional Data Sets for Data Integration In: Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining (KDD-2002).
MCCALLUM, A K., NIGAM K and UNGAR L (2000): Efficient Clustering of
High-Dimensional Data Sets with Application to Reference Matching In: Proceedings of the 6th International Conference On Knowledge Discovery and Data Mining (KDD-2000) NEILING, M (2005): Identification of Real-World Objects in Multiple Databases In: Pro- ceedings of GfKl Conference 2005.
RENDLE, S and SCHMIDT-THIEME, L (2006): Object Identification with Constraints In:
Proceedings of 6th IEEE International Conference on Data Mining (ICDM-2006) WINKLER W E (1999): The State of Record Linkage and Current Research Problems Tech-
nical report, Statistical Research Division, U.S Census Bureau