learning to map between schemas ontologies

Semantic Mappings between Schemashouse location contact house address name phone num-baths full-baths half-baths contact-info agent-name agent-phone 1-1 mapping non 1-1 mapping... – dat

Trang 1

Alon Halevy University of Washington

Joint work with Anhai Doan and Pedro Domingos

Learning to Map Between

Schemas Ontologies

Trang 2

– Solution that uses multi-strategy learning.

– We’ve started with schema matching (I.e., very simple

ontologies)

– Currently extending to more expressive ontologies.

– Experiments show the approach is very promising!

Trang 3

The Structure Mapping Problem

 Types of structures:

– Database schemas, XML DTDs, ontologies, …,

 Input:

– Two (or more) structures, S1 and S2

– Data instances for S 1 and S 2

– Background knowledge

 Output:

– A mapping between S1 and S2

– Should enable translating between data instances

– Semantics of mapping?

Trang 4

Semantic Mappings between Schemas

house location contact

house address

name phone

num-baths

full-baths half-baths

contact-info agent-name agent-phone

1-1 mapping non 1-1 mapping

Trang 5

 Database schema integration

– A problem as old as databases themselves

– database merging, data warehouses, data migration

– On the WWW, in enterprises, large science projects

– Model matching: key operator in an algebra where

models and mappings are first-class objects.

– See [Bernstein et al., 2000] for more.

 The Semantic Web

– Ontology mapping.

 System interoperability

– E-services, application integration, B2B applications, …,

Trang 6

Desiderata from Proposed Solutions

 Accuracy, efficiency, ease of use

 Realistic expectations:

– Unlikely to be fully automated Need user in the loop.

 Some notion of semantics for mappings

 Extensibility:

– Solution should exploit additional background knowledge.

 “Memory”, knowledge reuse:

– System should exploit previous manual or automatically

generated matchings.

– Key idea behind LSD.

Trang 7

LSD Overview

 L(earning) S(ource) D(escriptions)

mediated schema and a large set of data source schemas.

and learn from them to generate the rest

Trang 8

 Overview of structure mapping

 LSD architecture and details

 Experimental results

 Current work

Trang 9

Data Integration

Find houses with four bathrooms priced under $500,000

mediated schema

homes.com realestate.com

source schema 2

homeseekers.com

source schema 3 source schema 1

Applications: WWW, enterprises, science projects

Techniques: virtual data integration, warehousing, custom code.

wrappers

Query reformulation and optimization

Trang 10

Semantic Mappings between Schemas

house location contact

house address

name phone

num-baths

full-baths half-baths

contact-info agent-name agent-phone

1-1 mapping non 1-1 mapping

Trang 11

– W also includes the unmatched attributes of R and S.

universal relation W, and the mappings specify the

projection variables and correspondences

Trang 12

Why Matching is Difficult

 Aims to identify same real-world entity

– using names, structures, types, data values, etc

 Schemas represent same entity differently

– different names => same entity:

– area & address => location

– same names => different entities:

– area => location or square-feet

 Schema & data never fully capture semantics!

– not adequately documented, not sufficiently expressive

 Intended semantics is typically subjective!

– IBM Almaden Lab = IBM ?

 Cannot be fully automated Often hard for humans Committees are required!

Trang 13

Current State of Affairs

– largely done by hand

– labor intensive & error prone

– GTE: 4 hours/element for 27,000 elements [Li&Clifton00]

 Will only be exacerbated

– data sharing & XML become pervasive

– proliferation of DTDs

– translation of legacy data

– reconciling ontologies on semantic web

Trang 14

 Data integration and source mappings

 Current work

Trang 15

The LSD Approach

 User manually maps a few data sources to the

mediated schema

 LSD learns from the mappings, and proposes

mappings for the rest of the sources

 Several types of knowledge are used in learning:

– Schema elements, e.g., attribute names

– Data elements: ranges, formats, word frequencies, value frequencies, length of texts.

– Proximity of attributes

– Functional dependencies, number of attribute

occurrences.

 One learner does not fit all Use multiple learners

and combine with meta-learner

Trang 16

$250,000 $110,000

address price agent-phone description

Example

location

Miami, FL Boston, MA

(305) 729 0831 (617) 253 1429 .

Fantastic house Great location

in data values => description

Learned hypotheses

$550,000 $320,000

contact-phone

(278) 345 7215 (617) 335 2315 .

Beautiful yard Great beach

homes.com

If “ phone ” occurs

in the name => agent-phone

Mediated schema

Trang 17

Multi-Strategy Learning

 Use a set of base learners:

– Name learner, Nạve Bayes, Whirl, XML learner

 And a set of recognizers :

– County name, zip code, phone numbers.

 Each base learner produces a prediction weighted

by confidence score

 Combine base learners with a meta-learner , using

stacking

Trang 18

 Naive Bayes Learner [Domingos&Pazzani 97]

– “Kent, WA” => (address,0.8), (name,0.2)

 Whirl Learner [Cohen&Hirsh 98]

Trang 19

Training the Base Learners

Naive Bayes Learner

realestate.com

Name Learner

Schema of realestate.com Mediated schema

Trang 21

Meta-Learner: Stacking

 Training of meta-learner produces a weight for every pair of:

– (base-learner, mediated-schema element)

– weight(Name-Learner,address) = 0.1

– weight(Naive-Bayes, address ) = 0.9

 Combining predictions of meta-learner:

– computes weighted sum of base-learner confidence scores

Meta-Learner (address, 0.6*0.1 + 0.8*0.9 = 0.78)

Trang 22

Least-Squares Linear Regression

Training the Meta-Learner

Naive Bayes True Predictions

Weight(Name-Learner, address ) = 0.1 Weight(Naive-Bayes, address ) = 0.9

Trang 23

<extra-info> Beautiful yard </>

Meta-Learner

( address ,0.8), ( description ,0.2) ( address ,0.6), ( description ,0.4) ( address ,0.7), ( description ,0.3)

( description ,0.8), ( address ,0.2)

Meta-Learner

Name Learner Naive Bayes

( address ,0.7), ( description ,0.3)

( agent-phone ,0.9), ( description ,0.1)

Schema of homes.com Mediated schema

area day-phone extra-info

Trang 24

The Constraint Handler

 Extends learning to incorporate constraints

– a = agent-phone & b = agent-name

a & b are usually close to each other

– user feedback = hard or soft constraints

 Details in [Doan et al., SIGMOD 2001]

Trang 25

The Current LSD System

Mediated schema Source schemas

Data listings

Constraint Handler

Mappings User Feedback

Domain Constraints

Matching Phase Training Phase

Trang 26

 Current work

Trang 27

Empirical Evaluation

– Real Estate I & II, Course Offerings, Faculty Listings

– create mediated DTD & domain constraints

– choose five sources

– extract & convert data listings into XML (faithful to schema!)

– mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48

 Ten runs for each experiment - in each run:

– manually provide 1-1 mappings for 3 sources

– ask LSD to propose mappings for remaining 2 sources

– accuracy = % of 1-1 mappings correctly identified

Trang 29

Sensitivity to Amount of Available Data

Trang 30

Real Estate I Real Estate II Course Offerings Faculty Listings

Contribution of Schema vs Data

LSD with only schema info.

LSD with only data info.

Trang 31

Reasons for Incorrect Matching

 Unfamiliarity

– suburb

– solution: add a suburb-name recognizer

 Insufficient information

– correctly identified general type, failed to pinpoint exact type

– <agent-name>Richard Smith</>

– solution: add a proximity learner

 Subjectivity

– house-style = description?

Trang 32

Trang 33

Moving Up the Expressiveness Ladder

 Schemas are very simple ontologies

 More expressive power = More domain constraints

 Mappings become more complex, but constraints provide more to learn from

– F1(A1,…,Am) = F2(B1,…,Bm)

 Ontologies (of various flavors):

– Class hierarchy (I.e., containment on unary relations)

– Relationships between objects

– Constraints on relationships

Trang 34

 Given two schemas, find

– 1-many mappings: address = concat(city,state)

– many-1: half-baths + full-baths = num-baths

– many-many: concat( addr-line1 , addr-line2 ) = concat( street , city , state )

– expressed as query

– value correspondence expression: room-rate = rate * (1 + tax-rate ) – relationship: state of tax-rate = state of hotel that has rate

– special case: 1-many mappings between two relational tables

Finding Non 1-1 Mappings

Current work

address description num-baths

Source schema Mediated schema

city state comments half-baths full-baths

Trang 35

 For each set of mediated-schema columns

– enumerate all possible mappings

– evaluate & return best mapping

Source-schema columns Mediated-schema columns

compute sim ilarity using all ba se learners

Trang 36

Search-Based Solution

 States = columns

– goal state: mediated-schema column

– initial states: all source-schema columns

– use 1-1 matching to reduce the set of initial states

 Operators: concat, +, -, *, /, etc

 Column-similarity:

– use all base learners + recognizers

Trang 37

Multi-Strategy Search

 Use a set of expert modules: L1, L2, , Ln

– applies to only certain types of mediated-schema column

– searches a small subspace

– uses a cheap similarity measure to compare columns

– L1: text; concat; TF/IDF

– L2: numeric; +, -, *, /; [Ho et al 2000]

– L3: address; concat; Naive Bayes

 Search techniques

– beam search as default

– specialized, do not have to materialize columns

Trang 38

Multi-Strategy Search (cont’d)

 Combine modules’ predictions & select the best one

Trang 39

Related Work

TRANSCM [Milo&Zohar98]

ARTEMIS [Castano&Antonellis99]

[Palopoli et al 98]

CUPID [Madhavan et al 01]

SEMINT [Li&Clifton94]

ILA [Perkowitz&Etzioni95]

DELTA [Clifton et al 97]

LSD [Doan et al 2000, 2001] CLIO [Miller et al 00],[Yan et al 01]

Single Learner + 1-1 Matching

Hybrid + 1-1 Matching

Schema + Data 1-1 + non 1-1 Matching

Sophisticated Data-Driven User Interaction

Recognizers + Schema + 1-1 Matching

Trang 40

 LSD:

generate semantic mappings.

knowledge, and previous techniques.

promising.

Trang 41

Backup Slides

Trang 42

m3

m1

 Ten months later

– are the mappings still correct?

Trang 43

Information Extraction from Text

 Extract data fragments from text documents

– date, location, & victim’s name from a news article

 Intensive research on free-text documents

 Many documents do have substantial structure

– XML pages, name card, tables, list

– structure forms a schema

– only one data value per schema element

– “real” data source has many data values per schema element

 Ongoing research in the IE community

Trang 44

Contribution of Each Component

Trang 45

 Existing learners flatten out all structures

– similar to the Naive Bayes learner

– input instance = bag of tokens– differs in one crucial aspect

– consider not only text tokens, but also structure tokens

Exploiting Hierarchical Structure

Victorian house with a view Name your price!

To see it, contact Gail Murphy at MAX Realtors.

</ description >

< name > Gail Murphy </ name >

< firm > MAX Realtors </ firm >

</ contact >

Trang 46

Domain Constraints

 Impose semantic regularities on sources

– verified using schema or data

– a = address & b = address a = b

– a = house-id a is a key

– a = agent-info & b = agent-name b is nested in a

 Can be specified up front

– when creating mediated schema

– independent of any actual source schema

Trang 47

area : ( address ,0.7), ( description ,0.3)

The Constraint Handler

 Can specify arbitrary constraints

 User feedback = domain constraint

 Extended to handle domain heuristics

– a = agent-phone & b = agent-name a & b are usually close to each other

0.3 0.1 0.4 0.012

0.7 0.9 0.6 0.378

0.7 0.9 0.4 0.252

Domain Constraints

a = address & b = adderss a = b

Predictions from Meta-Learner

Định dạng
Số trang	47
Dung lượng	316 KB