1. Trang chủ
  2. » Công Nghệ Thông Tin

learning to map between schemas ontologies

47 253 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 47
Dung lượng 316 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Semantic Mappings between Schemashouse location contact house address name phone num-baths full-baths half-baths contact-info agent-name agent-phone 1-1 mapping non 1-1 mapping... – dat

Trang 1

Alon Halevy University of Washington

Joint work with Anhai Doan and Pedro Domingos

Learning to Map Between

Schemas Ontologies

Trang 2

– Solution that uses multi-strategy learning.

– We’ve started with schema matching (I.e., very simple

ontologies)

– Currently extending to more expressive ontologies.

– Experiments show the approach is very promising!

Trang 3

The Structure Mapping Problem

 Types of structures:

– Database schemas, XML DTDs, ontologies, …,

 Input:

– Two (or more) structures, S1 and S2

– Data instances for S 1 and S 2

Background knowledge

 Output:

– A mapping between S1 and S2

– Should enable translating between data instances

– Semantics of mapping?

Trang 4

Semantic Mappings between Schemas

house location contact

house address

name phone

num-baths

full-baths half-baths

contact-info agent-name agent-phone

1-1 mapping non 1-1 mapping

Trang 5

 Database schema integration

– A problem as old as databases themselves

– database merging, data warehouses, data migration

– On the WWW, in enterprises, large science projects

– Model matching: key operator in an algebra where

models and mappings are first-class objects.

– See [Bernstein et al., 2000] for more.

 The Semantic Web

– Ontology mapping.

 System interoperability

– E-services, application integration, B2B applications, …,

Trang 6

Desiderata from Proposed Solutions

 Accuracy, efficiency, ease of use

 Realistic expectations:

– Unlikely to be fully automated Need user in the loop.

 Some notion of semantics for mappings

 Extensibility:

– Solution should exploit additional background knowledge.

 “Memory”, knowledge reuse:

– System should exploit previous manual or automatically

generated matchings.

– Key idea behind LSD.

Trang 7

LSD Overview

 L(earning) S(ource) D(escriptions)

mediated schema and a large set of data source schemas.

and learn from them to generate the rest

Trang 8

 Overview of structure mapping

 LSD architecture and details

 Experimental results

 Current work

Trang 9

Data Integration

Find houses with four bathrooms priced under $500,000

mediated schema

homes.com realestate.com

source schema 2

homeseekers.com

source schema 3 source schema 1

Applications: WWW, enterprises, science projects

Techniques: virtual data integration, warehousing, custom code.

wrappers

Query reformulation and optimization

Trang 10

Semantic Mappings between Schemas

house location contact

house address

name phone

num-baths

full-baths half-baths

contact-info agent-name agent-phone

1-1 mapping non 1-1 mapping

Trang 11

– W also includes the unmatched attributes of R and S.

universal relation W, and the mappings specify the

projection variables and correspondences

Trang 12

Why Matching is Difficult

 Aims to identify same real-world entity

– using names, structures, types, data values, etc

 Schemas represent same entity differently

– different names => same entity:

– area & address => location

– same names => different entities:

– area => location or square-feet

 Schema & data never fully capture semantics!

– not adequately documented, not sufficiently expressive

 Intended semantics is typically subjective!

– IBM Almaden Lab = IBM ?

 Cannot be fully automated Often hard for humans Committees are required!

Trang 13

Current State of Affairs

– largely done by hand

– labor intensive & error prone

– GTE: 4 hours/element for 27,000 elements [Li&Clifton00]

 Will only be exacerbated

– data sharing & XML become pervasive

– proliferation of DTDs

– translation of legacy data

– reconciling ontologies on semantic web

Trang 14

 Overview of structure mapping

 Data integration and source mappings

 Experimental results

 Current work

Trang 15

The LSD Approach

 User manually maps a few data sources to the

mediated schema

 LSD learns from the mappings, and proposes

mappings for the rest of the sources

 Several types of knowledge are used in learning:

– Schema elements, e.g., attribute names

– Data elements: ranges, formats, word frequencies, value frequencies, length of texts.

– Proximity of attributes

– Functional dependencies, number of attribute

occurrences.

One learner does not fit all Use multiple learners

and combine with meta-learner

Trang 16

$250,000 $110,000

address price agent-phone description

Example

location

Miami, FL Boston, MA

(305) 729 0831 (617) 253 1429 .

Fantastic house Great location

in data values => description

Learned hypotheses

$550,000 $320,000

contact-phone

(278) 345 7215 (617) 335 2315 .

Beautiful yard Great beach

homes.com

If “ phone ” occurs

in the name => agent-phone

Mediated schema

Trang 17

Multi-Strategy Learning

 Use a set of base learners:

– Name learner, Nạve Bayes, Whirl, XML learner

 And a set of recognizers :

– County name, zip code, phone numbers.

 Each base learner produces a prediction weighted

by confidence score

 Combine base learners with a meta-learner , using

stacking

Trang 18

 Naive Bayes Learner [Domingos&Pazzani 97]

– “Kent, WA” => (address,0.8), (name,0.2)

 Whirl Learner [Cohen&Hirsh 98]

Trang 19

Training the Base Learners

Naive Bayes Learner

realestate.com

Name Learner

address price agent-phone description

Schema of realestate.com Mediated schema

Trang 21

Meta-Learner: Stacking

 Training of meta-learner produces a weight for every pair of:

– (base-learner, mediated-schema element)

– weight(Name-Learner,address) = 0.1

– weight(Naive-Bayes, address ) = 0.9

 Combining predictions of meta-learner:

– computes weighted sum of base-learner confidence scores

Meta-Learner (address, 0.6*0.1 + 0.8*0.9 = 0.78)

Trang 22

Least-Squares Linear Regression

Training the Meta-Learner

Naive Bayes True Predictions

Weight(Name-Learner, address ) = 0.1 Weight(Naive-Bayes, address ) = 0.9

Trang 23

<extra-info> Beautiful yard </>

Meta-Learner

( address ,0.8), ( description ,0.2) ( address ,0.6), ( description ,0.4) ( address ,0.7), ( description ,0.3)

( description ,0.8), ( address ,0.2)

Meta-Learner

Name Learner Naive Bayes

( address ,0.7), ( description ,0.3)

( agent-phone ,0.9), ( description ,0.1)

address price agent-phone description

Schema of homes.com Mediated schema

area day-phone extra-info

Trang 24

The Constraint Handler

 Extends learning to incorporate constraints

– a = agent-phone & b = agent-name

a & b are usually close to each other

– user feedback = hard or soft constraints

Details in [Doan et al., SIGMOD 2001]

Trang 25

The Current LSD System

Mediated schema Source schemas

Data listings

Constraint Handler

Mappings User Feedback

Domain Constraints

Matching Phase Training Phase

Trang 26

 Overview of structure mapping

 Data integration and source mappings

 LSD architecture and details

 Current work

Trang 27

Empirical Evaluation

– Real Estate I & II, Course Offerings, Faculty Listings

– create mediated DTD & domain constraints

– choose five sources

– extract & convert data listings into XML (faithful to schema!)

– mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48

 Ten runs for each experiment - in each run:

– manually provide 1-1 mappings for 3 sources

– ask LSD to propose mappings for remaining 2 sources

– accuracy = % of 1-1 mappings correctly identified

Trang 29

Sensitivity to Amount of Available Data

Trang 30

Real Estate I Real Estate II Course Offerings Faculty Listings

Contribution of Schema vs Data

LSD with only schema info.

LSD with only data info.

Trang 31

Reasons for Incorrect Matching

 Unfamiliarity

– suburb

– solution: add a suburb-name recognizer

 Insufficient information

– correctly identified general type, failed to pinpoint exact type

– <agent-name>Richard Smith</>

<phone> (206) 234 5412 </>

– solution: add a proximity learner

 Subjectivity

– house-style = description?

Trang 32

 Overview of structure mapping

 Data integration and source mappings

 LSD architecture and details

 Experimental results

Trang 33

Moving Up the Expressiveness Ladder

Schemas are very simple ontologies

 More expressive power = More domain constraints

 Mappings become more complex, but constraints provide more to learn from

– F1(A1,…,Am) = F2(B1,…,Bm)

 Ontologies (of various flavors):

– Class hierarchy (I.e., containment on unary relations)

– Relationships between objects

– Constraints on relationships

Trang 34

 Given two schemas, find

– 1-many mappings: address = concat(city,state)

– many-1: half-baths + full-baths = num-baths

– many-many: concat( addr-line1 , addr-line2 ) = concat( street , city , state )

– expressed as query

– value correspondence expression: room-rate = rate * (1 + tax-rate ) – relationship: state of tax-rate = state of hotel that has rate

– special case: 1-many mappings between two relational tables

Finding Non 1-1 Mappings

Current work

address description num-baths

Source schema Mediated schema

city state comments half-baths full-baths

Trang 35

 For each set of mediated-schema columns

– enumerate all possible mappings

– evaluate & return best mapping

Source-schema columns Mediated-schema columns

compute sim ilarity using all ba se learners

Trang 36

Search-Based Solution

 States = columns

– goal state: mediated-schema column

– initial states: all source-schema columns

– use 1-1 matching to reduce the set of initial states

 Operators: concat, +, -, *, /, etc

 Column-similarity:

– use all base learners + recognizers

Trang 37

Multi-Strategy Search

 Use a set of expert modules: L1, L2, , Ln

– applies to only certain types of mediated-schema column

– searches a small subspace

– uses a cheap similarity measure to compare columns

– L1: text; concat; TF/IDF

L2: numeric; +, -, *, /; [Ho et al 2000]

– L3: address; concat; Naive Bayes

 Search techniques

– beam search as default

– specialized, do not have to materialize columns

Trang 38

Multi-Strategy Search (cont’d)

 Combine modules’ predictions & select the best one

Trang 39

Related Work

TRANSCM [Milo&Zohar98]

ARTEMIS [Castano&Antonellis99]

[Palopoli et al 98]

CUPID [Madhavan et al 01]

SEMINT [Li&Clifton94]

ILA [Perkowitz&Etzioni95]

DELTA [Clifton et al 97]

DELTA [Clifton et al 97]

LSD [Doan et al 2000, 2001] CLIO [Miller et al 00],[Yan et al 01]

Single Learner + 1-1 Matching

Hybrid + 1-1 Matching

Schema + Data 1-1 + non 1-1 Matching

Sophisticated Data-Driven User Interaction

Recognizers + Schema + 1-1 Matching

Trang 40

 LSD:

generate semantic mappings.

knowledge, and previous techniques.

promising.

Trang 41

Backup Slides

Trang 42

m3

m1

 Ten months later

– are the mappings still correct?

Trang 43

Information Extraction from Text

 Extract data fragments from text documents

– date, location, & victim’s name from a news article

 Intensive research on free-text documents

 Many documents do have substantial structure

– XML pages, name card, tables, list

– structure forms a schema

– only one data value per schema element

– “real” data source has many data values per schema element

 Ongoing research in the IE community

Trang 44

Contribution of Each Component

Trang 45

 Existing learners flatten out all structures

– similar to the Naive Bayes learner

– input instance = bag of tokens– differs in one crucial aspect

– consider not only text tokens, but also structure tokens

Exploiting Hierarchical Structure

Victorian house with a view Name your price!

To see it, contact Gail Murphy at MAX Realtors.

</ description >

< name > Gail Murphy </ name >

< firm > MAX Realtors </ firm >

</ contact >

Trang 46

Domain Constraints

 Impose semantic regularities on sources

– verified using schema or data

– a = address & b = address a = b

– a = house-id a is a key

– a = agent-info & b = agent-name b is nested in a

 Can be specified up front

– when creating mediated schema

– independent of any actual source schema

Trang 47

area : ( address ,0.7), ( description ,0.3)

The Constraint Handler

 Can specify arbitrary constraints

 User feedback = domain constraint

 Extended to handle domain heuristics

– a = agent-phone & b = agent-name a & b are usually close to each other

0.3 0.1 0.4 0.012

0.7 0.9 0.6 0.378

0.7 0.9 0.4 0.252

Domain Constraints

a = address & b = adderss a = b

Predictions from Meta-Learner

Ngày đăng: 24/10/2014, 12:31

TỪ KHÓA LIÊN QUAN