1. Trang chủ
  2. » Công Nghệ Thông Tin

Tài liệu Mining Database Structure; Or, How to Build a Data Quality Browser docx

12 582 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Mining database structure; or, how to build a data quality browser
Tác giả Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, Vladislav Shkapenyuk
Trường học AT&T Labs–Research
Chuyên ngành Cơ sở dữ liệu
Thể loại Bài báo hội nghị
Năm xuất bản 2002
Thành phố Madison, Wisconsin, USA
Định dạng
Số trang 12
Dung lượng 304,46 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Mining Database Structure; Or, How to Build a Data QualityBrowser Tamraparni Dasu, Theodore Johnson, S.. To copy otherwise, to republish, to post on servers or to redistribute to lists,

Trang 1

Mining Database Structure; Or, How to Build a Data Quality

Browser

Tamraparni Dasu, Theodore Johnson, S Muthukrishnan, Vladislav Shkapenyuk

AT&T Labs–Research

ABSTRACT

2-"#/$(1a UFY"6"("6D6'R $#"#UFG SM"(bc: " 79

9: !+->1dA"(1k @FG "#+-/U"(1#%

I]"]'"L6"(A"#2- +./BMQ1Q"#H 7lWm(nYn

+- U$H9:"7"64 "%/% ;"#"Z6'FY f6,+-+-/ ?;6

1 INTRODUCTION

7+-"(y1 F&$"#5+-$#2-;6+-/7+.$# "($(231>"#"("6,6' T+-$( C

;64"(T+3+-"#8u"%/%- :"(ba"(T+.$("#! W$(Q ,"\k 1:"#ƒ\%

6,+-C

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee.

ACM SIGMOD ’2002 June 4-6, Madison, Wisconsin, USA

Copyright 2002 ACM 1-58113-497-5/02/06 5.00.

$# "#$(231H, T6"#2-+./>$#Q ,"(!6H"#"(+-/,2.2‚+.FY >+- 

+-$>$# Q%NˆV2g+./ @;2-1+-# X6B6P,+-+./]'"HFGQC

"{+3"W$#$#"(5 **23+-2-"f69"#@'"f6+-‹$#23 "(:"#+-A"

+-/dŒ ‘'’ ;5b5"W6+-$#k+-d, "

ˆ[ T6$\+ 869;"#‚ FG"#Z$# T+-8>1f66+3+- 2A 9C

6+3‹,$(23%j•z Q$(+-/d"#+3 1Uu"%/% Z$# : "f$#Q ,"(

,"(T'+- ]+-82-231|;'"d;6P C— FGCy6'"%U„ "#+-/MA"(1 6"(:"#;6"#$#+-"#f";2-231S 7>+-+-"#6P;6O>1U6"#/6" 'A"(+-,"%E•z TA"("#231 i92-"#,>#1L$( +-R;6 $#,"(T"#6

FY "#+-/EA"(1#% € +u%

6+3–"("(Tz92-"#7u"%

:"#6NlWm(nYnopq 5|6'|{;2-+3 1R9 'b@"\%`˜j"#2-2.>= 'T+.6"#

™ ps—pšpr#mf›;tQœQjn

92-"#;$#"## @92."(# W;6E?"#2.6#%

 ?;2-"#8u"%/%- ;T9:"(k F[ b@!+-U92-" ;T9:"(k FV6+-Q+-$\

2-"#k+-08?"#2.6 "($%

Trang 2

"60+.4"#$# 6#%

6'9"%

¢M£

"#k F[¥d%

¢M£

"( /"#"# Ĥ

"($# 6#%

>+./*  2.#%

1.1 Related Work

f92-"ŒŽŽ AŽ «T ¬’%z­4 "X/"#"(2-231  "X$,:"(FY ®p\›:¯

BQ+-/|6+-Q$#"|ŒA« @²T !޳’%a©W,+-"("#Q>+->+.M?;6+-/

+- &%

€!´A€!µ Œ- (;6 €X¶!·

©7^0Œ¸'"!{"(1Q1Q"#,j6"(+./"6

$(+g2-2-1>+-"# 2-+-/*,+-/Z;6Q$\2$# ¹;+-$(XŒT Ž«T T 

"($(i6'9;"5Q$\"fŒ

+-/,H"(T+3"Z69"f ,2-2 b`6',+-+-/{T"(+-"##%

92-"#@Œ

ސGƒ\ &66+3+- ;2_+-FG >+- 

9: !$( "#$(+- @9:"( bz"#"#4?"#2.6@+-X"#"6"6%

 /,# ,2."(i VbkA"#2-"(j$# "(‹,$(+."(TXŒ3 ¸'—%[I]"k"k"#"(2

+-<#"OŒ²’ i;6P"(2."($(+-+3y1]"#Q+->'+ KŒ

ŒŽ'z;6UQ",+-/>bkA"#2-"(Œ3 #²’%

?;"#2.6@+./>2-2,>'+."(#%

$#$#$(1%

6'9;"%

2 SUMMARIZING VALUES OF A FIELD

6'9"X+.>f>2-2;, V F$#"%j©Wj6f,+-+./Z2-/ C

2.1 Set Resemblance

¼7ÁV½dÀÂÀ¼7Ãi½dÀ%

"%

Å#ÇÈZÉ5ÊË ÈÊ-ÌÇÈ7Í[ỴÏ@%„ 7 f: "#Wb5"

À¼MÁ4½dÀ¿

¾fĐ= 

À¼À(ĐaÀ½dÀƒ Q 'ƒ

?;"(2g6  HÀ¼*À;6PÀ½dÀT+.!23b51X+-2g92-"%

T*9:"([ÕD%i“"(Wہ’¼WƒV¿K,+-ÜÝÞz’Ĩ‚u߃ƒ\%

ˆV'ŒÛA’¼fƒV¿NÛA’½ƒy;¿B¾

ÛA’¼fƒ[¿aÛA’½ƒy+-_k˜z"( 2.2-+

,"# '"#"XŒ¸’ 9&b5"i/+-A"5@{+-$f+-+3+-A"V"(2.+- %i•z C

¼8 ;,"#231Z"#2-"#$(jĨ&á&⁒³Aƒ‚+3F:+3V+-[+-¼8 "#2-"kĨ&á&â'Q 'ƒ\ "#2-"kĨ&á‚â'—Žƒ\

6 f &%

$# +.6"(8Ĩ á‚â +-B¼aÃ]½,%

+-;6+.$#"64914ÛA’¼fƒ[¿NÛA’½ƒ\%

 7+3z,"&%

,2-"##% }

’¼Wƒ\ê(ë#ëë(êÛ'ìZ’¼Wƒƒ\ ÜÝÞ ’Ĩ;íu߃ƒ\%_I|"7"#Q+->'"W¾>91

¾¿ ï à:ŒÛíQ’¼Wƒi¿NÛíuĩƒyGÂ(è

Trang 3

í $( 2.6L9:"d;+3b@+-"

+-6"#:"#6"#%

xy| "> FX+-/;'"## Vbz"d>A" bz U, T6+3?;$#+- 8FG 

³ê\Žù(âzúE (‚FG W$# TA"#+-"#@+."#/"(

u"%/%8 231S,FY"(bh,+-2-2-+- P6+-Q+-$(72-"#ƒ*Œ¸’%*„ fA"(1U2./"

³TêŽûù úR (_>+-2."609T102- /

޲;/"

‘ T%

"(92.$#" +3X2- $#d"#Q+->"W¾

Þüý

¿ªÀ¼|þj½dÀÂÀ¼|Ã,½dÀT;6

Þüý

¿ªÀ½Kþk¼ÀÂÀ¼MÃ0½dÀT9T1

ÿ Þüý ¿ ï

ígð âñò ñ à:ŒÛ'íQ’¼Wƒ=Û큒½*ƒyYÂ\è

ý&üÞ

ígð âñò ñ à:ŒÛ'íQ’¼Wƒ=Û큒½*ƒyYÂ\è

x—F ¾76ÿ¾î ý‚üÞ "j2./" 9Vÿ¾ Þ&üý +-‚>2-2u b5"i$8$# $#2-;6"

+-

é5’¼MÃd½ƒV¿Øu,+-&—Û

’¼Wƒ\êÛ

’½ƒƒ\ê(ë#ë#ë#ê,+-—Û ì ’¼fƒ\êÛ ì ’½*ƒƒƒ

69;"%

+./H ?;2-+-/ VH"(8+-/;'">+-*$( 2-2-"#$("6]FG 

"("#"#'+ #%

Ä Å(ËÇ#ÅÆÅÊÅÆ 'Ë ! 'Ë#"$Æ%

%&%

ÄÅÆÅ#ÇÈ(*)ÎÆÅ\ËÇ(Å*)ÎÆÅ+*)Î,ÊÅÆ

34- `Ä5

; ÆÅ(Ë/Ç#Å>=@?AB?ÏBzÉC(7AÎÆÅ#=@?

(7AÎ,ÊÅÆ=@?

+-A"(U*"#"#*92.$#"79:"(yb5"#"#0 b5 *?;"#2.6# b5"Z$H$# ,"

2.2 Multiset Resemblance

€ ?;"(2g68 F;X92-"5+-V!23+."\ +u%"%- X/+-A"#*"#2-"#,"#_$C

9:"(yb5"#"#4 "(2.+-/?;"(2g602."#!;602."fFG"{"#$(1§OˆV"\C

"(X+-/"(#%

2-"##%

“&"\@Kä9:">423+."\ [6]2-"(*¼

³k O9:"6M í NK0ƒ¿

÷]’Ó á‚â

’¼fƒ[ $#$# +-M23+-"(@KO%P˜z"($"HÓ á&â

’¼fƒ7+-"#2-"#$("6L+3FY ,2-1PC

™ žYrsut\ž’š(vTsuž’œq*¥0NK4ƒ[ FTKU%

¥0NK4ƒV¿VUjˆ['Œ÷HíNK4ƒ[¿K叿

Àß0æ9K »÷PußêRK4ƒV¿Kå&À

À¼*À

»åXWa Y

+.92."

¥0NK4ƒV¿]\_^

ígð

à:ŒM

NK4ƒV¿KåT

»AåXWa `

x P$# TA"(T+- ;2V,2-+-/ ‚+3"#,8",+3FY ,231O6 ,2-1

sužYr(m\s!rž.Ÿq;psuvTtm\%

“&+-A"k"\z+-/"## T2-+-"(i+-/"#5'"X,>92." QC +-/

Û NKªÃXbƒV¿ ,+-&—Û íNK4ƒ\êÛ í Nbƒƒ

÷HíNKªÃXbƒV¿ ÷4íNK4ƒ ÛíNK0ƒBÛ'íNbƒ

¿ ÷ í Nbƒ Û NK0ƒBÛ íNbƒ

¿ ÷4íNK4ƒÑL÷HíNbƒ ÛíNK0ƒV¿NÛ'íNbƒ

Ž‘'Gƒ\

F$KU%

"%Z+-

Trang 4

¥HNKSÀbƒV¿ U

à:ŒM íNK0ƒi¿Kå.gjà:Œé NK0ƒV¿ắ í Nbƒy

à:Œé NK0ƒV¿ắ í Nbƒy

»AåhWN  Y

I|"*$P$# ,"

žYrsutž’š(vTsuž’œq

my›m\q

Ib ’6MT+-$#"0A"(AƒZ91L$# ,+-/

¥0NK4ƒ8;6

¥0NKPÀbƒ\ z"%

+-/@iHjf"(Q "($%

23+."\k+-/;'"##%z“&"\»

ŒMªNKsmpotbƒy;¿ ^

íGð

M]íNK0ƒRM]íNbƒ à:ŒéíNK0ƒ[¿NéíNbƒy

íGð

à:ŒéíNK0ƒ[¿NéíNbƒy

Ksmpotb0À¿

ŒMhNKsmpotbƒy

 jĐ ỵ

À¼ÀĐNÀ½0Àƒ

¥0’¼fmpotbƒi¿

à:ŒM]íNK4ƒRM]íNbƒi¿Kågzà:ŒéíNK0ƒV¿ắíNbƒy

à:Œé NK0ƒV¿ắ í Nbƒy

»å_Wa  Y

/%-

Ä5 jÄ Å(ËÇ#Å.ỈÅÊÅỈHË: y 'Ë#"$Ỉz

v Ê-Ì{VÌÈR%

+-9+- H F_?;"(2g6

%&%'U+-/

ÄÅỈÅ#ÇÈ

34-

:wxBÄ5 iÄ5

; Å23gÅ<ỈÅ\ËÇ(Å<=s? Å<=s? ÅỈ~=s?E$?

:34-.X€

Ê3Ì{VÌÈ

2.3 Substring Resemblance

9"#&>+-+-"#6f9T1W6+3–"("(T /A+-<+- [ Fg"#f"#C

"("#86d+-/06+3–"("(TWFY >'# _9f+-DÙy"(T;2-2-1O+-,+3C

2.'Ú>b5#1#%5„ X"(,2-" 69;"@

“‚Q‰f," ;„_+3QC

yb5 ?"#2.6# ‡‚prs„ƒ7po>m@66ˆjžYtrs„ƒZpo,m\% €

2./"4FY >$# TA"#+-"#Q /"O;6M>+-2.+- Nu"%

"ŽT ³²«T - '‘Ž4: +-92-">«,9+3 € ^•zxxXC—/,ƒ\%

,>1> F_+-%

2.3.1 Q-gram Signature

€ Q#¯GŸtpo F‚Z"(@¼K @Z23+."\(KU%

{C—/ +-/;"W+-5$# ,"6

"(%

{AC—/Ð"("#92g$#"W F[ bz "(X¼`;60½ +-#»

¾X¿

ÀŠ!‹GŒf¼FMh’¼WƒÁhŠ!‹GŒf¼GMª’½*ƒ#À

ÀŠ!‹GŒf¼FMh’¼WƒÃhŠ!‹GŒf¼GMª’½*ƒ#À

60+-X"#Q+->"6d9T1

íGð âđị đ à:ŒÛíQNŠ!‹GŒf¼FMh’¼Wƒƒ[¿aÛíNŠ!‹GŒf¼GMª’½*ƒƒyYÂ\è

^+-$#"8bz"$# ,"yŠ!‹GŒf¼FMh’¼Wƒ@9:"\FY "*$# ,+-/>+3W"\

+-/" bz"!$#>Q "À

+-/"%

"#$(+- U F_yb5 ,{AC—/~"(@9T1

ÀŠ!‹GŒf¼FMh’¼Wƒ:Á9Š!‹GŒf¼GMª’½*ƒ#À¿

 zĐEÿ¾

ÀŠ!‹GŒf¼FMh’¼Wƒ#À(ĐNÀŠy‹GŒW¼GMª’½ƒ#Àƒ

42./"d{C—/ "#"(92.$#">'",2.+-A"(2-1O H9:">d"#2."6]91S

"#$2-2zFG  ^T"#$(+- MŽ%- 4H$( 2-"d F@"(FG25 :"(+-"#8 F@"\

+-/"(#%

{C—/~;2- /W F_¾

Þü\ý

6d¾ ý‚üÞ

Þ&üý

ígð

à:ŒÛ NŠ!‹GŒf¼GMª’¼fƒƒ:BÛ íNŠy‹GŒW¼GMª’½ƒƒyGÂ(è

ÿ¾

ý‚üÞ

ígð

à:ŒÛ NŠ!‹GŒf¼GMª’¼fƒƒ:BÛ íNŠy‹GŒW¼GMª’½ƒƒyGÂ(è

b@+.2-2&9:"72g'/"Z;6Sÿ¾

Þüý b@+-2-2_9:"f>2.2u%

,+.—é큒¼Wƒ\êé큒½*ƒƒ\êQơ_¿` ê#ëë#ë(êèP%i^T: "5b5"@'"X/+-A"#

"\z'"!$# +-"6,+-

?;"#2.6|½~†{AC—/ä"\%UxyFBŠ!‹GŒf¼FMh’¼Wƒ_ÃZŠ!‹GŒf¼FMhŽ7ƒ

$# 'A"(>P+-/+3?;$2-1E2./"(>FG$\+ = F<Šy‹GŒW¼GMª’½ƒ

Trang 5

^"($(+- OŽ%. %

2.4 Q-gram Sketches

 E"BrLAm\s—w±m\r]ŒŽ¸’ 8E6+-,"#+- ;2-231N"6$#"6K"#"("#C

+- D FLA"#$\ %~“"(‘¨9:"Pe’=6+.,"(+- 27A"#$\  W;6

2-"(X“

é[偐*ƒV¿Ø}˜>“

êë#ë#ë#êO}˜>“•”Aƒ

6+-Q$#"49:"(yb5"#"(EA"($( X

;6C



ê/



êO

ƒi¿ œ

íGð

—é[偐

ƒ(Œôu;ú|é[偐

ƒ(ŒôuGƒ

Âå

žYrs—pq:wm\%

"#$( z F_7?"#2.6d+-kZ >2-+-<#"6,$# 

+-4?"#2.64¼Z»

Š!,’¼fƒ(Œôu:¿

÷]p‰íQêOŠ!‹GŒf¼FMh’¼Wƒƒ

÷]p‰íQêxŠ!‹GŒf¼GMª’¼fƒƒ

žYrs—pq:wm*+-

/%

ŠtH’¼8ê½*ƒV¿§¦

NŠ!’¼fƒ(Œôuú¤Šy,’½*ƒ(ŒôuGƒ

A"($(C

åXWB¬Â2« j

NŠ!>’¼

ƒ\êOŠ!’¼

ƒƒ_+-j+-,N«#êO«ƒ_$# ?:6"#$("

+-"(2& ;6h™

NŠ!’¼

ƒ\êOŠ!>’¼

ƒƒ\%

{AC—/

?"#2.6

?"#2.6H2-"##%

2-+-A">{AC—/ç+-/;"## V+3+-86+3‹,$(23* H6"("(,+-","(

,>92." 9:"(+./2-+-"'!$( 9+-;+- #%

¬ Å(ËÇ#Å.ÆÅÊÅÆTÄ®8 !Ä®/"$TÆ%

A"#$\ X6+-Q$#"WFg J8;+-$#2.

?"#2.6 µ %&*%

{"f  \ƒ\»

ÄÅÆÅ#ÇÈ(*)ÎÆÅ\ËÇ(Å*)ÎÆÅ+*)Î,ÊÅÆ

Å3GÅ<(7AÎÆÅ(ËÇ#Å#=@? ÆÅ>=@?DB?ÏBzÉ

2.5 Finding Keys

/%j 6"("\C

,+-">+->'1|A"(1TC—+->'1P"(1| +->'1|A"(1TC’FY "#+-/MA"(1

+-z9,"6 91

+-,2-"#,"#'+ #»

I|"'",+-T"\"#Q"6O 231S+-]A"\1#  82-2VFY$(+- 2i6"\C

:"#;6"#$#+-"##%

, "7//"(+."*+-/$# ;6+3+- %

?;;6+-/Z $#"(#%5©W5 FG b5" ˜z"#2-2-> +.k+-"#6"6

+.6+-92-"  @+.Q"6Wbz"j9,+3T{-ÌÈVÉ5ÊË ÈÊ-ÌÇÈ:{"(+-"#

3 MINING DATABASE STRUCTURES

"

sG±mWr(m\sjœLP(Am/SrHP(œt!sG±žYr@s—pAš#n.m\  F¶_±psiœsG±m\tVzm(n

r5±p2ŸmFŸpnvm\r sG±pskptm7ržGožun-ptWs—œ>sG±TžGrV5m(n

kœžGqds—œZsG±TžYrks—pAš#n.m¸  ©¹rksG±TžYrzm(n

p8wœoW›œržGsymfœLP@sp¶jœ*œt@o,œtm

5m(n

„_+.6+-/$# ,: +-"f?"#2.6#%

3.1 Finding Join Paths

º ; 

92-"##%

 %k„_+.6L2-2k;+3 F@?"#2.6_»N¿®¼½iêx¼+º½

+.X*?"#2.64+-0¦

º:¿¿B¦Z%

92-"#i¦

;6>¦íQ%

½¾Ã

Trang 6

?"#2.6> F7¦ í 6#b@BFG Ä» ½

F_¦

u9:ƒ>„_+-;6O2-2VA"(1F¨ ½à Fz92-"*¦ í

?;"(2g6k F[¦Z%

+- ]Ž%Žd'"2-+-$92-"%fx—FiA"(19¨

àÛ>ÅxÆ#Y¦@ë¼8êQ¦

ŽFÇȁY¦

¼8ê ¦Xë¼Wƒi¿

àÛ>Å/Æ(Y¦Xë¼ZêQ¦ º ¼Wƒ

íGð

M í Y¦ º ¼fƒ à:ŒÛ Y¦Xë¼Wƒi¿NÛ íY¦ º ¼fƒy

ígð

à:ŒÛ Y¦@ë¼fƒV¿NÛ íY¦

¼fƒy

½>%

º 91

,+-uàÛ>Å/Æ#Y¦@ë¼8êy¦

¼fƒ àÛ>ÅxÆ#Y¦@ë½>êy¦

½ƒ\ê

ŽFÇȁY¦

¼8ê ¦Xë¼Wƒ\êxŽFÇȁY¦

½>ê¦@ë½*ƒ

3.2 Finding Composite Fields

"%-  "!?;"#2.60+-XdsutQpqr±P#œt\opsuž’œq> F

",?;"(2g6| OC

/%j

$(Q ,"\!+.6"#+-?"(G…g ³²«†9:"#$# ,"(G…3•! #³²A«'.†ƒ\%

2-"*$(Q ,"\#† ef,+.6 €

ef,+.6†6h…

¦WŽ%

 %

+-A"#4?"#2.66¼&

+-4?;"(2g60¤%

T%@ˆ[+3+- ]¤ +- 4¤ ½2À ê(ë#ë#ë#ê¤ ½Á ½ ÃW$# C

6#b@PFg  92-"

¦ %

‘T% ¶

¶ "#$2-2Fg 

+-W2./"#231O,9"(W F¼

914+./4ÿ¾.É

ÉLÊA%

,+-+-K2-"#A"#2A F;{AC—/a"("#92g$#"V"{+3"6WFG V6,+-+- 

+- 4 F‚yb5 *?"#2.6k+-@"+-231d$# ,"6,FG 

¤H½¾Ã'ÀË

*, QZ"%

+-<#">b5 2.6M9:" ibz"dMO{T"(1S S$( 2-2-"#$(>2.2z;+3* FX?;"#2.6

Þüý 

ý&üÞ

+-<#"8 Fi¤ ½¾Ã +.W% ½¾Ã $# +-8 #¬"#2-"#,"## :;6

3.3 Finding Heterogeneous Tables

“‚'/"* T6$(+- S6'9"#7 FG"#O9:"#$# ,"6+- 6"\"6S9:"\C

$"!f"(bB6;+.$(+."6"(T+3y1*QV9:"@, T6"#2-"6Hu"%/%

7"(bB"(T+-$#"W –"(+-/ 7"(b=$#Q ,"(j 1:" T7"(bK$#$# C +-/f $#"6" "($%

"\bh: +-/>92-"#Z;6S6+-,"#+ S92-"##% € Fg"(Z"#A"(2

7xy"("("(T+-$#" ''/"("6Z'_$# ,"\#% € 69"i+-_6"\C

+-/"6d *, T6"#2:+-;6+-+.6;2$# ,"(5 48?"#6C—+-$#"X2.%

>"(%

2.&% €

"( %

92-"%

 %

+."#092-"f¦Z

Trang 7

u9:ƒ>ˆ[+3+- ¤»]+- H2»½2Àê#ëë#ë#êx»½ Á

½ Ã#

+u%X©Wz2-2:>'+->29"(X¼

¼XÁ!¼ í Áy¼ ª À+-5>2-2’  <ÏEơ*¤Ð~ÏR÷P%[;6

¼ZÁX¼'íÁX¼xªÀ$09:"f6 "W+./8+.,2-"

í ;6

¼xªT +u%"%i9T1d$# ,+./

âđị đ

à:ŒÛ ¼:ƒi¿aÛ ” ¼ íƒV¿DÛ ” ¼ª ƒyGÂ(è

4 BELLMAN

I]"Z"86"(A"#2- +./dlWm(nYnopq ,6'9"Z9 b@"(XFY W$# C

€ 2-2_?;"(2g6,u"%/%- _T,"(+-$

"\ 1d F_  2-*’+-2.92-"Z

xjb@+-;6 b@X$#2-2.92-"Z91d9 

 ?2."(W+-OU+-"($\+."Zb5#1%@„ !"(,2-" ; "7  2[2-2 b@

I]"+-,2-"#,"#"6U˜j"#2-2.>]+- ´ '0+-/ ´ ˜k•D H$#$#"#

P©!$#2-"69;"%Z˜z"#$"˜z"#2-2->O+-f+-T"(;6"#64 d9:"*

´ >6 ´ ˜k•a"ZFu'W  d2 bØ;6H9//1

“|Œ

“U+-58•kĐ7Đ

˜k•

;6S©7•zx\%I|"Z 9+-"6U/  T6S:"(FG >$#"Z+-/d˜z"#2-2->

5 EXPERIMENTS

6"($(+-9:"#k"#A"(2:"#$(@ F[*2.'/"76*"( bz T+-/*"(T+.$("%

AŽS?;"(2g6+-

92-"#;6|T

 ?;2-"#> B2-2!?"#2.6,$# +-+-/|d2-"Q4޳]6+-Q+-$(d2-"#

Q 

"(:"#+." +-/Xk2./"z9:"( F;Ç-ÌÈ5ÊË ÈÊ3ÌÇ\È{"(+-"## ;6

5.1 Estimating Field Intersection Size

 bz 8?"#2.6#%i„ k7"#Q@67"( Tb5"!$# 2-2-"#$("6d¸¬76  ;+3

+-"("#$(+- 4+.<("%

"% Ž‘ÌØ j2-"#ƒ\%i”[$("#iFY 

2-"##%

"("#$\+ 0+-<#"(zFY @2-2,2."(#%V“‚'/"(@,2-"@+.<("#k+-, 'A"

"#2-;'+- 4+-,"f$# Q%

?;"(2g6,"{+3"6>9: k²³*"($# 6k+-/Ž‘³C—,2-"f+-/"( Ø 

"#$# ;6X+-/>‘³C—,2."7+-/;"#X 4 ŽA², ?;2-"6>?;"#2.6#%

5.2 Estimating Join Sizes

2-"#+-8z?;"#2.6% €

Trang 8

Error in intersection size estimation, 50 samples

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Resemblance

ÓBÔ±ÕÖT×¾ØeÙÚtÛÜÝ>Ø×Þ>ØßÝ#ԄàÜÂÞ#Ô±á8ؑØ.Þ>Ý>ÔpâZã+Ý>ԄàÜ:ä:å+æ¤Þ>ãâZç è±Ø.Þé

Error in intersection size estimation, 100 samples

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Resemblance

$(;2&"#"(92.$#"%

u"%/%,'Z2."#Q*Ž³Ìƒ\%

 ³³

Error in Join Size Estimation, 100 samples

0 0.2 0.4 0.6 0.8 1 1.2 1.4

Resemblance

Error in join size estimation, 250 samples

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

Resemblance

Trang 9

Unadjusted join size vs actual join size, 100 samples

1

10

100

1000

10000

100000

1000000

10000000

0 1E+06 1E+07 1E+08

Actual join size

çHè„ØÞé

,2-"#@2-!H66+3+- 2i ³, QC—Fg"{"#k,2-"##%

5.3 Q-gram Signatures

/ +-/"##%=I|"d6 ,231M"#2-"#$\"6B¸«P;+-, Ff?;"(2g6

"("#92g$#"* F@'Z2-"Q> '‘2Ìd%,I|"*"#Q+->"6

"("#92g$#"4+-/]{C—/

/Š+-/;'"#!6P '‘³C—,2-"*{AC—/J+./;'"## ;"(:"#$(C

$($#'"#231P"(Q+.>'+-/U{AC—/

Adjusted join size vs actual join size, 100 samples

1 10 100 1000 10000 100000 1000000 10000000 100000000

0 1E+06 1E+07 1E+08

Actual join size

Estimated vs Actual Q-gram Resemblance, 50 samples

0 0.2 0.4 0.6 0.8 1

Actual resemblance

ç è„ØÞé

60>2.2"#"#*92.$#"f+./,‘³C—,2-"7{ACy/Ð+-/;"%

5.4 Q-gram Sketches

+-/","(:"(+-,"#%*I]""#Q+->"

6+-Q$#"%

A"#$( X6+-Q$#"f+-H„_+-/"7²8FG !‘³C—,2."

"#Q+->"(# i;6P+.M„_+-/"O ³4FY 4 ‘³C—,2-">"#Q+->"##%

23C

A"#$\ Z6+.Q$#"6P{AC—/Š"#"(C

Trang 10

Estimated vs Actual Q-gram Resemblance, 150 Samples

0

0.2

0.4

0.6

0.8

1

Actual resemblance

çHè„ØÞé

Estimated vs actual q-gram vector distance, 50

sketch samples

0

0.2

0.4

0.6

0.8

1

1.2

Actual q-gram vector distance

Þ>ãâZç è„ØÞé

Estimated vs actual q-gram vector distance, 150 sketch

samples

0

0.2

0.4

0.6

0.8

1

1.2

Actual q-gram vector distance

Q-gram vector distance vs g-gram resemblance

0 0.2 0.4 0.6 0.8 1 1.2 1.4

Q-gram resemblance

×¾Øó Þ>Ø.â‘øHè„ãÜHß2Øé

+- H9:"\ b5"("#4"#"#*92.$#"7;646+-Q$("%

5.5 Qualitative Experiments

"(2- +- H $#"#XQ+-2.2&"#"#65 9:"ZQb5"\"6%

"!6'  7bz"X$# 56+-$#j1;+-$#2.k{"(1*"(23#%

  2-#%

5.5.1 Using Multiset Resemblance

 >/+-A"(H?;"#2.6% }

˜z"#2-2->;ƒ\%V˜z"#2-2->†V9+-2-+3 18 f{+-$T2-17;68+-T"\$(+-A"#231Z?;6

?;"(2g6#%

+-7$#$#"(9231O2 b7% €

... commercial advantage and that copies

bear this notice and the full citation on the first page To copy otherwise, to< /h3>

republish, to post on servers or to redistribute to. ..

A& #34;#$\ Z6+.Q$#"6P{AC—/Š"#"(C

Trang 10

Estimated...

Trang 5

^"($(+- O%.%

2.4 Q-gram Sketches

 E"BrLAm\swm\r]á 8E6+-,"#+- ;2-231N"6$#"6K"#"("#C

Ngày đăng: 19/02/2014, 12:20

TỪ KHÓA LIÊN QUAN