Semantic Mappings between Schemashouse location contact house address name phone num-baths full-baths half-baths contact-info agent-name agent-phone 1-1 mapping non 1-1 mapping... – dat
Trang 1Alon Halevy University of Washington
Joint work with Anhai Doan and Pedro Domingos
Learning to Map Between
Schemas Ontologies
Trang 2– Solution that uses multi-strategy learning.
– We’ve started with schema matching (I.e., very simple
ontologies)
– Currently extending to more expressive ontologies.
– Experiments show the approach is very promising!
Trang 3The Structure Mapping Problem
Types of structures:
– Database schemas, XML DTDs, ontologies, …,
Input:
– Two (or more) structures, S1 and S2
– Data instances for S 1 and S 2
– Background knowledge
Output:
– A mapping between S1 and S2
– Should enable translating between data instances
– Semantics of mapping?
Trang 4Semantic Mappings between Schemas
house location contact
house address
name phone
num-baths
full-baths half-baths
contact-info agent-name agent-phone
1-1 mapping non 1-1 mapping
Trang 5 Database schema integration
– A problem as old as databases themselves
– database merging, data warehouses, data migration
– On the WWW, in enterprises, large science projects
– Model matching: key operator in an algebra where
models and mappings are first-class objects.
– See [Bernstein et al., 2000] for more.
The Semantic Web
– Ontology mapping.
System interoperability
– E-services, application integration, B2B applications, …,
Trang 6Desiderata from Proposed Solutions
Accuracy, efficiency, ease of use
Realistic expectations:
– Unlikely to be fully automated Need user in the loop.
Some notion of semantics for mappings
Extensibility:
– Solution should exploit additional background knowledge.
“Memory”, knowledge reuse:
– System should exploit previous manual or automatically
generated matchings.
– Key idea behind LSD.
Trang 7LSD Overview
L(earning) S(ource) D(escriptions)
mediated schema and a large set of data source schemas.
and learn from them to generate the rest
Trang 8 Overview of structure mapping
LSD architecture and details
Experimental results
Current work
Trang 9Data Integration
Find houses with four bathrooms priced under $500,000
mediated schema
homes.com realestate.com
source schema 2
homeseekers.com
source schema 3 source schema 1
Applications: WWW, enterprises, science projects
Techniques: virtual data integration, warehousing, custom code.
wrappers
Query reformulation and optimization
Trang 10Semantic Mappings between Schemas
house location contact
house address
name phone
num-baths
full-baths half-baths
contact-info agent-name agent-phone
1-1 mapping non 1-1 mapping
Trang 11– W also includes the unmatched attributes of R and S.
universal relation W, and the mappings specify the
projection variables and correspondences
Trang 12Why Matching is Difficult
Aims to identify same real-world entity
– using names, structures, types, data values, etc
Schemas represent same entity differently
– different names => same entity:
– area & address => location
– same names => different entities:
– area => location or square-feet
Schema & data never fully capture semantics!
– not adequately documented, not sufficiently expressive
Intended semantics is typically subjective!
– IBM Almaden Lab = IBM ?
Cannot be fully automated Often hard for humans Committees are required!
Trang 13Current State of Affairs
– largely done by hand
– labor intensive & error prone
– GTE: 4 hours/element for 27,000 elements [Li&Clifton00]
Will only be exacerbated
– data sharing & XML become pervasive
– proliferation of DTDs
– translation of legacy data
– reconciling ontologies on semantic web
Trang 14 Overview of structure mapping
Data integration and source mappings
Experimental results
Current work
Trang 15The LSD Approach
User manually maps a few data sources to the
mediated schema
LSD learns from the mappings, and proposes
mappings for the rest of the sources
Several types of knowledge are used in learning:
– Schema elements, e.g., attribute names
– Data elements: ranges, formats, word frequencies, value frequencies, length of texts.
– Proximity of attributes
– Functional dependencies, number of attribute
occurrences.
One learner does not fit all Use multiple learners
and combine with meta-learner
Trang 16$250,000 $110,000
address price agent-phone description
Example
location
Miami, FL Boston, MA
(305) 729 0831 (617) 253 1429 .
Fantastic house Great location
in data values => description
Learned hypotheses
$550,000 $320,000
contact-phone
(278) 345 7215 (617) 335 2315 .
Beautiful yard Great beach
homes.com
If “ phone ” occurs
in the name => agent-phone
Mediated schema
Trang 17Multi-Strategy Learning
Use a set of base learners:
– Name learner, Nạve Bayes, Whirl, XML learner
And a set of recognizers :
– County name, zip code, phone numbers.
Each base learner produces a prediction weighted
by confidence score
Combine base learners with a meta-learner , using
stacking
Trang 18 Naive Bayes Learner [Domingos&Pazzani 97]
– “Kent, WA” => (address,0.8), (name,0.2)
Whirl Learner [Cohen&Hirsh 98]
Trang 19Training the Base Learners
Naive Bayes Learner
realestate.com
Name Learner
address price agent-phone description
Schema of realestate.com Mediated schema
Trang 21Meta-Learner: Stacking
Training of meta-learner produces a weight for every pair of:
– (base-learner, mediated-schema element)
– weight(Name-Learner,address) = 0.1
– weight(Naive-Bayes, address ) = 0.9
Combining predictions of meta-learner:
– computes weighted sum of base-learner confidence scores
Meta-Learner (address, 0.6*0.1 + 0.8*0.9 = 0.78)
Trang 22Least-Squares Linear Regression
Training the Meta-Learner
Naive Bayes True Predictions
Weight(Name-Learner, address ) = 0.1 Weight(Naive-Bayes, address ) = 0.9
Trang 23<extra-info> Beautiful yard </>
Meta-Learner
( address ,0.8), ( description ,0.2) ( address ,0.6), ( description ,0.4) ( address ,0.7), ( description ,0.3)
( description ,0.8), ( address ,0.2)
Meta-Learner
Name Learner Naive Bayes
( address ,0.7), ( description ,0.3)
( agent-phone ,0.9), ( description ,0.1)
address price agent-phone description
Schema of homes.com Mediated schema
area day-phone extra-info
Trang 24The Constraint Handler
Extends learning to incorporate constraints
– a = agent-phone & b = agent-name
a & b are usually close to each other
– user feedback = hard or soft constraints
Details in [Doan et al., SIGMOD 2001]
Trang 25The Current LSD System
Mediated schema Source schemas
Data listings
Constraint Handler
Mappings User Feedback
Domain Constraints
Matching Phase Training Phase
Trang 26 Overview of structure mapping
Data integration and source mappings
LSD architecture and details
Current work
Trang 27Empirical Evaluation
– Real Estate I & II, Course Offerings, Faculty Listings
– create mediated DTD & domain constraints
– choose five sources
– extract & convert data listings into XML (faithful to schema!)
– mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48
Ten runs for each experiment - in each run:
– manually provide 1-1 mappings for 3 sources
– ask LSD to propose mappings for remaining 2 sources
– accuracy = % of 1-1 mappings correctly identified
Trang 29Sensitivity to Amount of Available Data
Trang 30Real Estate I Real Estate II Course Offerings Faculty Listings
Contribution of Schema vs Data
LSD with only schema info.
LSD with only data info.
Trang 31Reasons for Incorrect Matching
Unfamiliarity
– suburb
– solution: add a suburb-name recognizer
Insufficient information
– correctly identified general type, failed to pinpoint exact type
– <agent-name>Richard Smith</>
<phone> (206) 234 5412 </>
– solution: add a proximity learner
Subjectivity
– house-style = description?
Trang 32 Overview of structure mapping
Data integration and source mappings
LSD architecture and details
Experimental results
Trang 33Moving Up the Expressiveness Ladder
Schemas are very simple ontologies
More expressive power = More domain constraints
Mappings become more complex, but constraints provide more to learn from
– F1(A1,…,Am) = F2(B1,…,Bm)
Ontologies (of various flavors):
– Class hierarchy (I.e., containment on unary relations)
– Relationships between objects
– Constraints on relationships
Trang 34 Given two schemas, find
– 1-many mappings: address = concat(city,state)
– many-1: half-baths + full-baths = num-baths
– many-many: concat( addr-line1 , addr-line2 ) = concat( street , city , state )
– expressed as query
– value correspondence expression: room-rate = rate * (1 + tax-rate ) – relationship: state of tax-rate = state of hotel that has rate
– special case: 1-many mappings between two relational tables
Finding Non 1-1 Mappings
Current work
address description num-baths
Source schema Mediated schema
city state comments half-baths full-baths
Trang 35 For each set of mediated-schema columns
– enumerate all possible mappings
– evaluate & return best mapping
Source-schema columns Mediated-schema columns
compute sim ilarity using all ba se learners
Trang 36Search-Based Solution
States = columns
– goal state: mediated-schema column
– initial states: all source-schema columns
– use 1-1 matching to reduce the set of initial states
Operators: concat, +, -, *, /, etc
Column-similarity:
– use all base learners + recognizers
Trang 37Multi-Strategy Search
Use a set of expert modules: L1, L2, , Ln
– applies to only certain types of mediated-schema column
– searches a small subspace
– uses a cheap similarity measure to compare columns
– L1: text; concat; TF/IDF
– L2: numeric; +, -, *, /; [Ho et al 2000]
– L3: address; concat; Naive Bayes
Search techniques
– beam search as default
– specialized, do not have to materialize columns
Trang 38Multi-Strategy Search (cont’d)
Combine modules’ predictions & select the best one
Trang 39Related Work
TRANSCM [Milo&Zohar98]
ARTEMIS [Castano&Antonellis99]
[Palopoli et al 98]
CUPID [Madhavan et al 01]
SEMINT [Li&Clifton94]
ILA [Perkowitz&Etzioni95]
DELTA [Clifton et al 97]
DELTA [Clifton et al 97]
LSD [Doan et al 2000, 2001] CLIO [Miller et al 00],[Yan et al 01]
Single Learner + 1-1 Matching
Hybrid + 1-1 Matching
Schema + Data 1-1 + non 1-1 Matching
Sophisticated Data-Driven User Interaction
Recognizers + Schema + 1-1 Matching
Trang 40 LSD:
generate semantic mappings.
knowledge, and previous techniques.
promising.
Trang 41Backup Slides
Trang 42m3
m1
Ten months later
– are the mappings still correct?
Trang 43Information Extraction from Text
Extract data fragments from text documents
– date, location, & victim’s name from a news article
Intensive research on free-text documents
Many documents do have substantial structure
– XML pages, name card, tables, list
– structure forms a schema
– only one data value per schema element
– “real” data source has many data values per schema element
Ongoing research in the IE community
Trang 44Contribution of Each Component
Trang 45 Existing learners flatten out all structures
– similar to the Naive Bayes learner
– input instance = bag of tokens– differs in one crucial aspect
– consider not only text tokens, but also structure tokens
Exploiting Hierarchical Structure
Victorian house with a view Name your price!
To see it, contact Gail Murphy at MAX Realtors.
</ description >
< name > Gail Murphy </ name >
< firm > MAX Realtors </ firm >
</ contact >
Trang 46Domain Constraints
Impose semantic regularities on sources
– verified using schema or data
– a = address & b = address a = b
– a = house-id a is a key
– a = agent-info & b = agent-name b is nested in a
Can be specified up front
– when creating mediated schema
– independent of any actual source schema
Trang 47area : ( address ,0.7), ( description ,0.3)
The Constraint Handler
Can specify arbitrary constraints
User feedback = domain constraint
Extended to handle domain heuristics
– a = agent-phone & b = agent-name a & b are usually close to each other
0.3 0.1 0.4 0.012
0.7 0.9 0.6 0.378
0.7 0.9 0.4 0.252
Domain Constraints
a = address & b = adderss a = b
Predictions from Meta-Learner