Mining Database Structure; Or, How to Build a Data QualityBrowser Tamraparni Dasu, Theodore Johnson, S.. To copy otherwise, to republish, to post on servers or to redistribute to lists,
Trang 1Mining Database Structure; Or, How to Build a Data Quality
Browser
Tamraparni Dasu, Theodore Johnson, S Muthukrishnan, Vladislav Shkapenyuk
AT&T Labs–Research
ABSTRACT
2-"#/$(1aUFY"6"("6D6'R$#"#UFGSM"(bc:" 79
9:!+->1dA"(1k@FG"#+-/U"(1#%
I]"]'"L6"(A"#2-+./BMQ1Q"#H 7lWm(nYn
+-U$H9:"7"64"%/% ;"#"Z6'FYf6,+-+-/ ?;6
1 INTRODUCTION
7+-"(y1F&$"#5+-$#2-;6+-/7+.$#"($(231>"#"("6,6' T+-$(C
;64"(T+3+-"#8u"%/%- :"(ba"(T+.$("#!W$(Q,"\k1:"#\%
6,+-C
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ACM SIGMOD ’2002 June 4-6, Madison, Wisconsin, USA
Copyright 2002 ACM 1-58113-497-5/02/06 5.00.
$#"#$(231H,T6"#2-+./>$#Q,"(!6H"#"(+-/,2.2+.FY>+-
+-$>$#Q%NV2g+./ @;2-1+-# X6B6P,+-+./]'"HFGQC
"{+3"W$#$#"(5**23+-2-"f69"#@'"f6+-$#23 "(:"#+-A"
+-/d ' ;5b5"W6+-$#k+-d,"
[T6$\+ 869;"#FG"#Z$#T+-8>1f66+3+-2A9C
6+3,$(23%jzQ$(+-/d"#+31Uu"%/% Z$#:"f$#Q,"(
,"(T'+-]+-82-231|;'"d;6PCFGCy6'"%U"#+-/MA"(1 6"(:"#;6"#$#+-"#f";2-231S7>+-+-"#6P;6O>1U6"#/6" 'A"(+-,"%EzTA"("#231 i92-"#,>#1L$(+-R;6$#,"(T"#6
FY"#+-/EA"(1#% +u%
6+3"("(Tz92-"#7u"%
:"#6NlWm(nYnopq 5|6'|{;2-+31R9'b@"\%`j"#2-2.>='T+.6"#
psppr#mf;tQQjn
92-"#;$#"## @92."(# W;6E?"#2.6#%
?;2-"#8u"%/%- ;T9:"(kF[b@!+-U92-" ;T9:"(kFV6+-Q+-$\
2-"#k+-08?"#2.6 "($%
Trang 2"60+.4"#$#6#%
6'9"%
¢M£
"#kF[¥d%
¢M£
"(/"#"#§
"($#6#%
>+./*2.#%
1.1 Related Work
f92-" A «T ¬%z4"X/"#"(2-231 "X$,:"(FY®p\:¯
BQ+-/|6+-Q$#"|A« @²T !³%a©W,+-"("#Q>+->+.M?;6+-/
+-&%
!´A!µ - (;6 X¶!·
©7^0¸'"!{"(1Q1Q"#,j6"(+./"6
$(+g2-2-1>+-"#2-+-/*,+-/Z;6Q$\2$#¹;+-$(XT «T T
"($(i6'9;"5Q$\"f
+-/,H"(T+3"Z69"f,2-2 b`6',+-+-/{T"(+-"##%
92-"#@
G\ &66+3+-;2_+-FG>+-
9:!$("#$(+-@9:"(bz"#"#4?"#2.6@+-X"#"6"6%
/,# ,2."(iVbkA"#2-"(j$#"(,$(+."(TX3 ¸'%[I]"k"k"#"(2
+-<#"O² i;6P"(2."($(+-+3y1]"#Q+->'+ K
'z;6UQ",+-/>bkA"#2-"(3 #²%
?;"#2.6@+./>2-2,>'+."(#%
$#$#$(1%
6'9;"%
2 SUMMARIZING VALUES OF A FIELD
6'9"X+.>f>2-2;,VF$#"%j©Wj6f,+-+./Z2-/C
2.1 Set Resemblance
¼7ÁV½dÀÂÀ¼7Ãi½dÀ%
"%
Å#ÇÈZÉ5ÊËÈÊ-ÌÇÈ7Í[ỴÏ@%7f:"#Wb5"
À¼MÁ4½dÀ¿
¾fĐ=
À¼À(ĐaÀ½dÀ Q '
?;"(2g6 HÀ¼*À;6PÀ½dÀT+.!23b51X+-2g92-"%
T*9:"([ÕD%i"(WÛ¼WV¿K,+-ÜÝÞzĨuß\%
V'ÛA¼fV¿NÛA½y;¿B¾
ÛA¼f[¿aÛA½y+-_kz"(2.2-+
,"# '"#"X¸ 9&b5"i/+-A"5@{+-$f+-+3+-A"V"(2.+-%izC
¼8 ;,"#231Z"#2-"#$(jĨ&á&â³A+3F:+3V+-[+-¼8 "#2-"kĨ&á&â'Q '\ "#2-"kĨ&áâ'\
6f&%
$#+.6"(8Ĩ áâ +-B¼aÃ]½,%
+-;6+.$#"64914ÛA¼f[¿NÛA½\%
7+3z,"&%
,2-"##% }
¼W\ê(ë#ëë(êÛ'ìZ¼W\ ÜÝÞ Ĩ;íuß\%_I|"7"#Q+->'"W¾>91
¾¿ ï à:ÛíQ¼Wi¿NÛíuĩyGÂ(è
Trang 3í $(2.6L9:"d;+3b@+-"
+-6"#:"#6"#%
xy|">FX+-/;'"## Vbz"d>A"bzU,T6+3?;$#+-8FG
³ê\ù(âzúE (FGW$#TA"#+-"#@+."#/"(
u"%/%8231S,FY"(bh,+-2-2-+-P6+-Q+-$(72-"#*¸%*fA"(1U2./"
³Têûù úR (_>+-2."609T102-/
²;/"
T%
"(92.$#" +3X2-$#d"#Q+->"W¾
Þüý
¿ªÀ¼|þj½dÀÂÀ¼|Ã,½dÀT;6
Þüý
¿ªÀ½Kþk¼ÀÂÀ¼MÃ0½dÀT9T1
ÿ Þüý ¿ ï
ígð âñò ñ à:Û'íQ¼W=Ûí½*yYÂ\è
ý&üÞ
ígð âñò ñ à:Û'íQ¼W=Ûí½*yYÂ\è
xF ¾76ÿ¾î ýüÞ "j2./" 9Vÿ¾ Þ&üý +->2-2u b5"i$8$#$#2-;6"
+-
é5¼MÃd½V¿Øu,+-&Û
¼W\êÛ
½\ê(ë#ë#ë#ê,+-Û ì ¼f\êÛ ì ½*
69;"%
+./H?;2-+-/ VH"(8+-/;'">+-*$(2-2-"#$("6]FG
"("#"#'+ #%
Ä Å(ËÇ#ÅÆÅÊÅÆ'Ë !'Ë#"$Æ%
%&%
ÄÅÆÅ#ÇÈ(*)ÎÆÅ\ËÇ(Å*)ÎÆÅ+*)Î,ÊÅÆ
34- `Ä5
; ÆÅ(Ë/Ç#Å>=@?AB?ÏBzÉC(7AÎÆÅ#=@?
(7AÎ,ÊÅÆ=@?
+-A"(U*"#"#*92.$#"79:"(yb5"#"#0b5*?;"#2.6# b5"Z$H$#,"
2.2 Multiset Resemblance
?;"(2g68F;X92-"5+-V!23+."\ +u%"%- X/+-A"#*"#2-"#,"#_$C
9:"(yb5"#"#4"(2.+-/?;"(2g602."#!;602."fFG"{"#$(1§OV"\C
"(X+-/"(#%
2-"##%
&"\@Kä9:">423+."\ [6]2-"(*¼
³kO9:"6M í NK0¿
÷]Ó áâ
¼f[$#$# +-M23+-"(@KO%Pz"($"HÓ á&â
¼f7+-"#2-"#$("6L+3FY,2-1PC
Yrsut\(vTsuq*¥0NK4[FTKU%
¥0NK4V¿VUj['÷HíNK4[¿Kå¿
Àß0æ9K »÷PußêRK4V¿Kå&À
À¼*À
ȌXWa Y
+.92."
¥0NK4V¿]\_^
ígð
à:M
NK4V¿KåT
»AåXWa `
xP$#TA"(T+-;2V,2-+-/ +3"#,8",+3FY,231O6,2-1
suYr(m\s!r.q;psuvTtm\%
&+-A"k"\z+-/"## T2-+-"(i+-/"#5'"X,>92." QC +-/
Û NKªÃXbV¿ ,+-&Û íNK4\êÛ í Nb
÷HíNKªÃXbV¿ ÷4íNK4 ÛíNK0BÛ'íNb
¿ ÷ í Nb Û NK0BÛ íNb
¿ ÷4íNK4ÑL÷HíNb ÛíNK0V¿NÛ'íNb
'G\
F$KU%
"%Z+-
Trang 4¥HNKSÀbV¿ U
à:M íNK0i¿Kå.gjà:é NK0V¿ắ í Nby
à:é NK0V¿ắ í Nby
»AåhWN Y
I|"*$P$#,"
Yrsut(vTsuq
mym\q
Ib 6MT+-$#"0A"(AZ91L$#,+-/
¥0NK48;6
¥0NKPÀb\ z"%
+-/@iHjf"(Q "($%
23+."\k+-/;'"##%z&"\»
MªNKsmpotby;¿ ^
íGð
M]íNK0RM]íNbà:éíNK0[¿NéíNby
íGð
à:éíNK0[¿NéíNby
Ksmpotb0À¿
MhNKsmpotby
jĐ ỵ
À¼ÀĐNÀ½0À
¥0¼fmpotbi¿
à:M]íNK4RM]íNbi¿Kågzà:éíNK0V¿ắíNby
à:é NK0V¿ắ í Nby
Ȍ_Wa Y
/%-
Ä5jÄ Å(ËÇ#Å.ỈÅÊÅỈHË: y'Ë#"$Ỉz
v Ê-Ì{VÌÈR%
+-9+-HF_?;"(2g6
%&%'U+-/
ÄÅỈÅ#ÇÈ
34-
:wxBÄ5iÄ5
; Å23gÅ<ỈÅ\ËÇ(Å<=s? Å<=s? ÅỈ~=s?E$?
:34-.X
Ê3Ì{VÌÈ
2.3 Substring Resemblance
9"#&>+-+-"#6f9T1W6+3"("(T/A+-<+-[Fg"#f"#C
"("#86d+-/06+3"("(TWFY>'# _9f+-DÙy"(T;2-2-1O+-,+3C
2.'Ú>b5#1#%5X"(,2-" 69;"@
Qf," ;_+3QC
yb5?"#2.6# prs7po>m@66jYtrsZpo,m\%
2./"4FY>$#TA"#+-"#Q/"O;6M>+-2.+-Nu"%
"T ³²«T - '4:+-92-">«,9+3 ^zxxXC/,\%
,>1>F_+-%
2.3.1 Q-gram Signature
Q#¯Gtpo FZ"(@¼K@Z23+."\(KU%
{C/ +-/;"W+-5$#,"6
"(%
{AC/Ð"("#92g$#"WF[bz"(X¼`;60½ +-#»
¾X¿
À!Gf¼FMh¼WÁh!Gf¼GMª½*#À
À!Gf¼FMh¼WÃh!Gf¼GMª½*#À
60+-X"#Q+->"6d9T1
íGð âđị đ à:ÛíQN!Gf¼FMh¼W[¿aÛíN!Gf¼GMª½*yYÂ\è
^+-$#"8bz"$#,"y!Gf¼FMh¼W@9:"\FY"*$#,+-/>+3W"\
+-/" bz"!$#>Q"À
+-/"%
"#$(+-UF_yb5,{AC/~"(@9T1
À!Gf¼FMh¼W:Á9!Gf¼GMª½*#À¿
zĐEÿ¾
À!Gf¼FMh¼W#À(ĐNÀyGW¼GMª½#À
42./"d{C/ "#"(92.$#">'",2.+-A"(2-1OH9:">d"#2."6]91S
"#$2-2zFG ^T"#$(+-M%- 4H$(2-"dF@"(FG25:"(+-"#8F@"\
+-/"(#%
{C/~;2-/WF_¾
Þü\ý
6d¾ ýüÞ
Þ&üý
ígð
à:Û N!Gf¼GMª¼f:BÛ íNyGW¼GMª½yGÂ(è
ÿ¾
ýüÞ
ígð
à:Û N!Gf¼GMª¼f:BÛ íNyGW¼GMª½yGÂ(è
b@+.2-2&9:"72g'/"Z;6Sÿ¾
Þüý b@+-2-2_9:"f>2.2u%
,+.éí¼W\êéí½*\êQơ_¿` ê#ëë#ë(êèP%i^T:"5b5"@'"X/+-A"#
"\z'"!$#+-"6,+-
?;"#2.6|½~{AC/ä"\%UxyFB!Gf¼FMh¼W_ÃZ!Gf¼FMh7
$#'A"(>P+-/+3?;$2-1E2./"(>FG$\+ =F<yGW¼GMª½
Trang 5^"($(+-O%. %
2.4 Q-gram Sketches
E"BrLAm\sw±m\r]¸ 8E6+-,"#+-;2-231N"6$#"6K"#"("#C
+-DFLA"#$\%~"(¨9:"Pe=6+.,"(+-27A"#$\ W;6
2-"(X
é[å*V¿Ø}>
êë#ë#ë#êO}>A
6+-Q$#"49:"(yb5"#"(EA"($(X
;6C
ê/
êO
i¿
íGð
é[å
(ôu;ú|é[å
(ôuG
Âå
Yrspq:wm\%
"#$(zF_7?"#2.6d+-kZ>2-+-<#"6,$#
+-4?"#2.64¼Z»
!,¼f(ôu:¿
÷]píQêO!Gf¼FMh¼W
÷]píQêx!Gf¼GMª¼f
Yrspq:wm*+-
/%
tH¼8ê½*V¿§¦
N!¼f(ôuú¤y,½*(ôuG
A"($(C
åXWB¬Â2« j
N!>¼
\êO!¼
_+-j+-,N«#êO«_$#?:6"#$("
+-"(2&;6h
N!¼
\êO!>¼
\%
{AC/
?"#2.6
?"#2.6H2-"##%
2-+-A">{AC/ç+-/;"## V+3+-86+3,$(23*H6"("(,+-","(
,>92." 9:"(+./2-+-"'!$(9+-;+-#%
¬ Å(ËÇ#Å.ÆÅÊÅÆTÄ®8 !Ä®/"$TÆ%
A"#$\X6+-Q$#"WFgJ8;+-$#2.
?"#2.6 µ %&*%
{"f\\»
ÄÅÆÅ#ÇÈ(*)ÎÆÅ\ËÇ(Å*)ÎÆÅ+*)Î,ÊÅÆ
Å3GÅ<(7AÎÆÅ(ËÇ#Å#=@? ÆÅ>=@?DB?ÏBzÉ
2.5 Finding Keys
/%j6"("\C
,+-">+->'1|A"(1TC+->'1P"(1|+->'1|A"(1TCFY"#+-/MA"(1
+-z9,"6 91
+-,2-"#,"#'+ #»
I|"'",+-T"\"#Q"6O231S+-]A"\1# 82-2VFY$(+-2i6"\C
:"#;6"#$#+-"##%
,"7//"(+."*+-/$#;6+3+-%
?;;6+-/Z$#"(#%5©W5FGb5" z"#2-2-> +.k+-"#6"6
+.6+-92-" @+.Q"6Wbz"j9,+3T{-ÌÈVÉ5ÊËÈÊ-ÌÇÈ:{"(+-"#
3 MINING DATABASE STRUCTURES
"
sG±mWr(m\sjLP(Am/SrHP(t!sG±Yr@spA#n.m\ F¶_±psisG±m\tVzm(n
r5±p2mFpnvm\r sG±pskptm7rGoun-ptWs>sG±TGrV5m(n
kGqdsZsG±TYrkspA#n.m¸ ©¹rksG±TYrzm(n
p8woWrGsymfLP@sp¶j*t@o,tm
5m(n
_+.6+-/$#,:+-"f?"#2.6#%
3.1 Finding Join Paths
º ;
92-"##%
%k_+.6L2-2k;+3F@?"#2.6_»N¿®¼½iêx¼+º½
+.X*?"#2.64+-0¦
º:¿¿B¦Z%
92-"#i¦
;6>¦íQ%
½¾Ã
Trang 6?"#2.6>F7¦ í 6#b@BFGÄ» ½
F_¦
u9:>_+-;6O2-2VA"(1F¨ ½Ã Fz92-"*¦ í
?;"(2g6kF[¦Z%
+-]%d'"2-+-$92-"%fxFiA"(19¨
àÛ>ÅxÆ#Y¦@ë¼8êQ¦
FÇÈY¦
¼8ê¦Xë¼Wi¿
àÛ>Å/Æ(Y¦Xë¼ZêQ¦ º ¼W
íGð
M í Y¦ º ¼fà:Û Y¦Xë¼Wi¿NÛ íY¦ º ¼fy
ígð
à:Û Y¦@ë¼fV¿NÛ íY¦
¼fy
½>%
º 91
,+-uàÛ>Å/Æ#Y¦@ë¼8êy¦
¼fàÛ>ÅxÆ#Y¦@ë½>êy¦
½\ê
FÇÈY¦
¼8ê¦Xë¼W\êxFÇÈY¦
½>ê¦@ë½*
3.2 Finding Composite Fields
"%- "!?;"#2.60+-XdsutQpqr±P#t\opsuq>F
",?;"(2g6|OC
/%j
$(Q,"\!+.6"#+-?"(G g ³²«9:"#$#,"(G 3! #³²A«'.\%
2-"*$(Q,"\# ef,+.6
ef,+.66h
¦W%
%
+-A"#4?"#2.66¼&
+-4?;"(2g60¤%
T%@[+3+-]¤ +-4¤ ½2À ê(ë#ë#ë#ê¤ ½Á ½ ÃW$#C
6#b@PFg 92-"
¦ %
T% ¶
¶ "#$2-2Fg
+-W2./"#231O,9"(WF¼
914+./4ÿ¾.É
ÉLÊA%
,+-+-K2-"#A"#2AF;{AC/a"("#92g$#"V"{+3"6WFGV6,+-+-
+-4Fyb5*?"#2.6k+-@"+-231d$#,"6,FG
¤H½¾Ã'ÀË
*,QZ"%
+-<#">b52.6M9:" ibz"dMO{T"(1SS$(2-2-"#$(>2.2z;+3*FX?;"#2.6
Þüý
ý&üÞ
+-<#"8Fi¤ ½¾Ã +.W% ½¾Ã $#+-8 #¬"#2-"#,"## :;6
3.3 Finding Heterogeneous Tables
'/"*T6$(+-S6'9"#7FG"#O9:"#$#,"6+-6"\"6S9:"\C
$"!f"(bB6;+.$(+."6"(T+3y1*QV9:"@,T6"#2-"6Hu"%/%
7"(bB"(T+-$#"W"(+-/ 7"(b=$#Q,"(j1:" T7"(bK$#$#C +-/f$#"6" "($%
"\bh:+-/>92-"#Z;6S6+-,"#+ S92-"##% Fg"(Z"#A"(2
7xy"("("(T+-$#" ''/"("6Z'_$#,"\#% 69"i+-_6"\C
+-/"6d*,T6"#2:+-;6+-+.6;2$#,"(548?"#6C+-$#"X2.%
>"(%
2.&%
"(%
92-"%
%
+."#092-"f¦Z
Trang 7u9:>[+3+-¤»]+-H2»½2Àê#ëë#ë#êx»½ Á
½ Ã#
+u%X©Wz2-2:>'+->29"(X¼
¼XÁ!¼ í Áy¼ ª À+-5>2-2 <ÏEơ*¤Ð~ÏR÷P%[;6
¼ZÁX¼'íÁX¼xªÀ$09:"f6"W+./8+.,2-"
í ;6
¼xªT +u%"%i9T1d$#,+./
âđị đ
à:Û ¼:i¿aÛ ¼ íV¿DÛ ¼ª yGÂ(è
4 BELLMAN
I]"Z"86"(A"#2-+./dlWm(nYnopq ,6'9"Z9b@"(XFYW$#C
2-2_?;"(2g6,u"%/%- _T,"(+-$
"\1dF_2-*+-2.92-"Z
xjb@+-;6b@X$#2-2.92-"Z91d9
?2."(W+-OU+-"($\+."Zb5#1%@!"(,2-" ;"72[2-2 b@
I]"+-,2-"#,"#"6Uj"#2-2.>]+- ´ '0+-/ ´ kDH$#$#"#
P©!$#2-"69;"%Zz"#$"z"#2-2->O+-f+-T"(;6"#64d9:"*
´ >6 ´ ka"ZFu'Wd2 bØ;6H9//1
|
U+-58kĐ7Đ
k
;6S©7zx\%I|"Z9+-"6U/T6S:"(FG>$#"Z+-/dz"#2-2->
5 EXPERIMENTS
6"($(+-9:"#k"#A"(2:"#$(@F[*2.'/"76*"(bzT+-/*"(T+.$("%
AS?;"(2g6+-
92-"#;6|T
?;2-"#>B2-2!?"#2.6,$#+-+-/|d2-"Q4³]6+-Q+-$(d2-"#
Q
"(:"#+." +-/Xk2./"z9:"(F;Ç-ÌÈ5ÊËÈÊ3ÌÇ\È{"(+-"## ;6
5.1 Estimating Field Intersection Size
bz8?"#2.6#%ik7"#Q@67"( Tb5"!$#2-2-"#$("6d¸¬76 ;+3
+-"("#$(+-4+.<("%
"% ÌØj2-"#\%i[$("#iFY
2-"##%
"("#$\+ 0+-<#"(zFY@2-2,2."(#%V'/"(@,2-"@+.<("#k+-,'A"
"#2-;'+-4+-,"f$#Q%
?;"(2g6,"{+3"6>9:k²³*"($#6k+-/³C,2-"f+-/"( Ø
"#$#;6X+-/>³C,2."7+-/;"#X4 A²,?;2-"6>?;"#2.6#%
5.2 Estimating Join Sizes
2-"#+-8z?;"#2.6%
Trang 8Error in intersection size estimation, 50 samples
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Resemblance
ÓBÔ±ÕÖT×¾ØeÙÚtÛÜÝ>Ø×Þ>ØßÝ#ÔàÜÂÞ#Ô±á8ØØ.Þ>Ý>ÔpâZã+Ý>ÔàÜ:ä:å+æ¤Þ>ãâZçè±Ø.Þé
Error in intersection size estimation, 100 samples
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Resemblance
$(;2&"#"(92.$#"%
u"%/%,'Z2."#Q*³Ì\%
³³
Error in Join Size Estimation, 100 samples
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Resemblance
Error in join size estimation, 250 samples
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Resemblance
Trang 9Unadjusted join size vs actual join size, 100 samples
1
10
100
1000
10000
100000
1000000
10000000
0 1E+06 1E+07 1E+08
Actual join size
çHèØÞé
,2-"#@2-!H66+3+-2i ³,QCFg"{"#k,2-"##%
5.3 Q-gram Signatures
/ +-/"##%=I|"d6,231M"#2-"#$\"6B¸«P;+-,Ff?;"(2g6
"("#92g$#"*F@'Z2-"Q> '2Ìd%,I|"*"#Q+->"6
"("#92g$#"4+-/]{C/
/+-/;'"#!6P '³C,2-"*{AC/J+./;'"## ;"(:"#$(C
$($#'"#231P"(Q+.>'+-/U{AC/
Adjusted join size vs actual join size, 100 samples
1 10 100 1000 10000 100000 1000000 10000000 100000000
0 1E+06 1E+07 1E+08
Actual join size
Estimated vs Actual Q-gram Resemblance, 50 samples
0 0.2 0.4 0.6 0.8 1
Actual resemblance
çèØÞé
60>2.2"#"#*92.$#"f+./,³C,2-"7{ACy/Ð+-/;"%
5.4 Q-gram Sketches
+-/","(:"(+-,"#%*I]""#Q+->"
6+-Q$#"%
A"#$(X6+-Q$#"f+-H_+-/"7²8FG!³C,2."
"#Q+->"(# i;6P+.M_+-/"O ³4FY4 ³C,2-">"#Q+->"##%
23C
A"#$\Z6+.Q$#"6P{AC/"#"(C
Trang 10Estimated vs Actual Q-gram Resemblance, 150 Samples
0
0.2
0.4
0.6
0.8
1
Actual resemblance
çHèØÞé
Estimated vs actual q-gram vector distance, 50
sketch samples
0
0.2
0.4
0.6
0.8
1
1.2
Actual q-gram vector distance
Þ>ãâZçèØÞé
Estimated vs actual q-gram vector distance, 150 sketch
samples
0
0.2
0.4
0.6
0.8
1
1.2
Actual q-gram vector distance
Q-gram vector distance vs g-gram resemblance
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Q-gram resemblance
×¾Øó Þ>Ø.âøHèãÜHß2Øé
+-H9:"\b5"("#4"#"#*92.$#"7;646+-Q$("%
5.5 Qualitative Experiments
"(2-+-H$#"#XQ+-2.2&"#"#659:"ZQb5"\"6%
"!6' 7bz"X$#56+-$#j1;+-$#2.k{"(1*"(23#%
2-#%
5.5.1 Using Multiset Resemblance
>/+-A"(H?;"#2.6% }
z"#2-2->;\%Vz"#2-2->V9+-2-+318f{+-$T2-17;68+-T"\$(+-A"#231Z?;6
?;"(2g6#%
+-7$#$#"(9231O2 b7%
... commercial advantage and that copiesbear this notice and the full citation on the first page To copy otherwise, to< /h3>
republish, to post on servers or to redistribute to. ..
A& #34;#$\Z6+.Q$#"6P{AC/"#"(C
Trang 10Estimated...
Trang 5^"($(+-O%.%
2.4 Q-gram Sketches
E"BrLAm\swm\r]á 8E6+-,"#+-;2-231N"6$#"6K"#"("#C