Cho n6n, trong dcrn vi hgc trinh cira tdi, t6i da chgn cho minh ae tai *Xdy dpg c6ng cg tim kii5m tpp tin vdn bin theo nQi dung" ldm bii kh6a lufn ttit nghiQp... D6i v6i nhirng ai li tin
Trang 1TRTIoNG DAr Hec rAv o6
pfEcz xu6,r LUAN 16r NGHrpp
oB rAr: xAv Dqlyc cONc cq riur xmur
SVTH: t Avr rry cnAM pA
LOP DAI HQC TIN HQC 4
MSSV:0951010050
GVHD: Th.s NGUYTN Crri cUONc
cAN THo 2ot3L.003736I
Trang 2Loi cdm on
Sau b6n n6m tr6n giing dudng d4i hgc, di5n nay di s6p ra truong tdi thenh thAt
ti6p sirc t6i trong qu5 trinh hqc t?p ctng nhu trong c6c kj, thi
Trong qu6 trinh ldm lu4n vdn t6t nghiQp, t6i dA nh4n dugc sg hudng d6n cfing nhu
trg girip nhiQt tinh cta Thiy Nguy6n Chi Culng, li gi6o vi6n hucmg d6n ThAy d5 tichcyc trong qu6 trinh rdn luyQn k! ning l4p trinh cfing nhu ki6n thfc cho t6i D6n nay chuong trinh dd hoin thinh t6i xin gti loi cim on chdn thinh t16n Thiy
Ngodi ra, t6i xin gtii ldi cim crn siu sic d6n cha mg, ngudi d5 cho t6i budc chdn
trong cuQc s5ng Ngoii ra, t6i cffng xin gui loi cim crn thdn thi6t d6n t6t c6 bpn bd de quan tdm e6p ! d6n bni lu{n v6n cf,ng nhu de girip dd t6i
Cutii cung bdi lupn vdn dd hoan thinh, t5t nhi6n kh6ng thi5 tr6nh kh6i c6c sai s6t mong nhfln du-o c sU g6p y cira quf thdy cd vd c6c bpn.
Cin tho,ngiy 03 th6ng 06 ndm 2013
Sinh vi6n thuc hi€n
LAm Thf ChIm Pa
Trang 3Ldi ndi ddu
r-,OT NOI DAU
Lufn v5n t6t nghiQp li mQt m6n hgc do sinh vi6n tg hgc vd tim hi6u, tIrS nghiCn
hec tfp
Ngiy nay, thgc t(: c6 rdt nhi6u nhu ciu ddi h6i chirng ta phii giii quyi5t sao cho
hqp ly Thpt v4y, trong xE hQi tiiSn bQ hiQn nay khi cugc s6ng dd c6 ph6n sung tric vi
thoi gian ld tti€u mong udc cria m6i ngudi
O m5i khia cpnh kh6c nhau c6 c6c y6u cAu kh6c nhau, ri6ng AOi vOi ngdnh c6ng
c6 nhdn Edy thyc sg li mQt ddi h6i cdn thi6t, viQc tim kitim hic ndo cfing dugc uu ti€n
hdng dAu, vi chi khi c6 mQt phAn m6m hd trg tim kii5m t6t thi n6 sE c5 trlc dgng to l6n
trong c6ng viQc cfrng nhu trong hgc tap ctra b4n.
AOi vOi moi ngudi vi vfly phii bi6t cSch su dgng sao cho hqp ly vd hiru fch Cho n6n,
trong dcrn vi hgc trinh cira tdi, t6i da chgn cho minh ae tai *Xdy dpg c6ng cg tim kii5m tpp tin vdn bin theo nQi dung" ldm bii kh6a lufn ttit nghiQp.
Trang 4T6m tdt
C6ng cg tim ki6m ld mQt phAn m6m hiru ich vi cAn thitit cho m6y vi tinh, sE kh6khln khi phii tim thti c6ng, n6 ldm ban m6t qu5 nhi6u thoi gian, ldm chflm tr6 c6ng
vi6c cria b4n, thay vdo d6 khi da c6 c6ng cg tim ki6m trong tay, b4n sE nhin rdi hon, vi
ti6t kiQm dugc nhi€u thdi gian hcrn cho c6ng viQc khric.
Cdng cg tim ki6m tap tin v6n bAn theo nQi dung ld mQt phAn mOm kh6 quan trdng
vi cAn thi6t cho nhirng ai sir dung m6y vi tinh
Trong bdi b6o c6o kh6a lufn ndy t6i sE trinh biy qu6 trinh xdy dlmg mdt c6ng cutim ki6m t4p tin v6n bin theo nQi dung, d0 mqi ngudi hi€u rd hon sg cdn thi6t ctra n6.
ABSTRACT
manual find, it's will lost more time of you, and maybe your work must to delay Instead, if you use the search engine, you will saving much time for a next work
In my thesis report, i will be presents the processes to do a search text files bycontent I hope this became useful with everybody.
Trang 5Nh,2n xdt cua etdo vien huong diln
NHAN xET cul crAo vrfiN Hr.IoNG nAN
***
Cin tho, ngdy th6ng ndm 2013
Nguy6n Chi Cudng
t
Trang 6Nhdn xdt cila gido vi€n phan biQn
Cdn tho,ngey thring ndm 2013
Gi6o vi6n phin biQn
Le Thi Thu Lan
* * {< _
Trang 7Danh muc hinh
Hinh 2.1 B-tree bqc 2 c6 3 muc 13
Ilinh 2.2 : Tim ki€m gid tri X 14
Hinh 2.3 : Qud trinh tdch node khi thlm kh6a 22 16
Hinh 2.4: Qud trinh tdch node khi thOm kh6a 15 17
Ilinh 2.5: ThAmvdo cdc khda 35, 7,26 vd /8 17
Hinh 2.6: Qud trinh tdch node khi thAm kh6a 22 17
Hinh 2.7: Qud trinh tdch node khi th1m khoa i 1g Hinh 2.8: ThAmvdo cdc kh6a 42, 13,46, 27 vd 8 18
Hinh 2.9: Qud trinh tdch node khi th1m kh6a 32 lg Hinh 4.10: Thdm vdo 38, 24 vd 45 18
Hinh 2.1l: Qud trinh tdch node dQ quy khi thAm kh6a 25 19
Hinh j.2 Cdy nhi phdn 29
Hinh 4.l: Giao diQn chinh cila chuong trinh 36
Hinh 4.2: Giao diQn dgcfile 37
Hinh 4.3: Giao diQn qudt file 37
Hinh 4.4: Giao diQn qudt thu mttc 3g Hinh 4.5: Giao di€n cdc k€t qud tim drqc 3g Hinh 4.6: Giao di€n tim theo ftnfile 39
Hinh 4.7: Giao diQn tim theo n6i dung 39
Trang 8Danh muc bang
' Bdng 3 t Bi€u didn cdu trilc lm,t trfi chi m4c theo m6 hinh tra c*u lugn ly Errorl
Bookmark not defined
' Bdng 3.2 MO td cdch ddnh s6 cho tqp tin trong m6 hinh lqp chi mttc
nguqc 24
Bdng 3.3 Xdy d,rng tQp chi mqc nghlch dao co sdp x€p vd gom nh6m 25
Trang 9Muc lr,rc
1 rY oo cHo N op rat 1
2 TiNH HiNHNGHTEN ctru LrEN euAN I 2.1 C6ng cp tim kitim cira Windows 1
2.2 C6ng cp tim kitim cira Google 2
2.3 C6ng cp tim ki6m ctia Ping 3
3 MUC TIEU VA NHrEM VU CUA KHOA LUAN 4
3.1 Mpc ti6u 4
3.2 NhiQm vu 4
4 HTJONG GrAr QUYET s cHTIONG 2z CA sd r,f THUyfT 6
1 lY ruuvpr HQ th6ng tim ki6m tr0n m6y tinh c6 nhdn 8
1.1
t.2
Trang 10Mttc ltrc
2.2 Gi6i thigu B-Tree 12
CHITONG 3: NQI DUNG THTIC HIEN 23
l LAP CHi MUC 23
1.1 Phucrng ph6p chi mgc nghich d6o 23
1.2 X6y dpg c6u truc luu trfi cho tu diiSn trong phucrng ph6p lpp chi mpc nghich d6o 26
2 BQ TiM ruprr{ 30
: xAy D\NG TAp rrN xML 30
3.2 Tap tin DocID 30
3.3 Tpp tin chi muc 31
4 Lr.ru D0 32
4.1 Luu <16 tpo chi muc cho h1e 32
4.2 Luu <16 qu6t 6 dia/ thu mqc 33
s u0 simr 34
5.1 T6ng qu6t 34
5.2 Tim ki6m 35
CHI.IONG 4: TTT LUAN
TAI LryU THAM KHAo 41
Trang 11Chuong t: T6ngquan
1 Lv DO CHQN DE TAr
Tn khi linh vuc c6ng nghQ th6ng tin chi6m linh thi trudng, thi viQc tim kii5m
kh6ng cdn li thupt ngfi m6i me v6i ngudi ding nta Vi6c tra cr?u, tim ki6m c6c tu kh6atr6n web kh6ng m6t qu6 vii gi6y Ae ni6n thi ki5t qu6 D6i v6i nhirng ai li tin dd cira m?ng thi viQc tra tir kh6a tim ki6m thdng tin ld di6u kh6ng tne tnieu AOi vOi frg.
BCn canh d6, nhu cdu tim ki6m th6ng tin tr6n mrly tinh c6 nhin cflng kh6 quan
trgng, viQc truy xu6t dtln c6c nQi dung b€n trong thu muc ctng nhu vi6c tim kitim th6ngtin tr6n webpage vQy, d6u trA vC kiSt qu6 nhu y bgn mong dqi vC dich vp mi bpn str dung.
V0 nguy6n t6c viQc luu trfr th6ng tin vd tim ki6m th6ng tin thi don giin Gi6 su c6 mQt kho chrla tii liQu vd mQt ngudi mudn tim c6c tdi li6u li6n quan di5n y€u cAu cria
minh Ngudi d6 c6 th6 tlqc tet cir c6c tii li6u trong kho, git lui circ tii lieu li6n quan vi
b6 di c:ic tdi liQu kh6ng li6n quan R6 rdng giii phrlp ndy kh6ng thuc t6 vi t6n r6t ntriOu
thdi gian.
Chinh vi vfy, vi6c xdy dUng c6ng cp tim kii5m tpp tin vdn bin theo nQi dung ld
tr6n c6ng cp ndy chirng ta chi cdn nh6 mQt vdi tri kh6a c6 li6n quan d6n nQi dung di5
tim ra tfp tin d6.
2 TiNH nixH NGHTEN cLfu LrEN euAN
2.1 C6ng cg tim ki6m cfra Windows
2.1.1 Uu di6m
Trong c6c hQ di6u hinh cria Windows ddu c6 sEn c6ng cu tim kii5m d6y ld
the le cilc tQp tin, thu mgc, tim mQt mriy tinh trong hQ th6ng m?ng ho{c tim mQt ngudi
,^
ndo <1o trong sd dia chi Address book vd ngodi ra c6 th6 tim th6ng tin tr6n mAng Internet.
Trang 12Chuong 1: Tdngquan
- Tim chc tQp tin hinh 6nh, 6m thanh ho{c video D6nh d6u vdo c6c 6 iI6 chqnlo4i tap tin vd nhap t6n cira tQp tin mu6n tim, c6 th6 chi can nnap mQt phin cta t6n tQp tin, n6u dC tr6ng thi chuong trinh sE tim t6t ch circ tpp tin c6 trong hQ th6ng.
hoflc chgn Don't remember n6u kh6ng nhd vd nhap t6n cira t4p tin mudn tim
- Tim tdt ch chc tQ,p tin vi thu mqc, nh4p t6n cta tpp tin ho{c thu muc cAn tim,
tim trong Look in.
- Tim mQt m6y tfnh trong hQ thiSng mAng ho{c tim mQt nguoi ndo d6 trong sd
dia chi Address book N6u chgn tim People in your address book thi sE xu6t higncria sd cho ph6p nhQp c6c th6ng tin vA ngudi mu5n tim, nhflp mQt ho{c tdt ch cilc
thdng tin vd nh6n Find now dt5 tim
- Ngodi c5c thitit l6p tim ki6m co bin, chuong trinh cdn cho ph6p thii5t l6p c6c
ki€u tim ki6m mo r6ng hcrn nhu:
., Thdi gian tpp tin dugc tpolsua.
n Kich thu6c cta tQp tin
" Chgn c6c thi6t l4p mo rQng.
,,, Chon kieu dinh dpng tpp tin mu5n tim
,,, Tim trong c6c thu mpc hQ thting
Tim c6ct$p tin vi thu mpc An
Tim trong c5c thu mqc con.
" Tim chinh x6c t6n theo cht thuong holc chfr hoa.
o Tim trong c6c thi6t bi sao ltru.
Trang 13Tru6c h6t, Google Desktop khoi dQng nhanh hon c6ng cg tim ki6m ctra
Windows ngudi ding chi cin nh6y k6p vdo bi6u tucrng cria chuong trinh trongkhay hQ th6ng dO mo trang tim ki6m trong trinh duyQt, ti6p d6 96 tu kh6a tim ki6m
vd nh6n Enter vi c6c ktlt qui sC xu6t hiQn H9 cfing kh6ng cAn chgn gita tim ki6mtheo tOn file hay theo nQi dung ho{c <l!nh dpng vi Google Desktop tim t6t ci mQt lAn.
Thri hai, Google Desktop lo4i b6 nhu ciu su dUng c6c c6ng cu tim kiiSm
chpm vi kh6 tich hqp trong c6c phdn m6m web, e-mail vd vin phdng cria
Microsoft
Tinh nlng tim ki6m nhanh, tilt ch mQt 6n cria Google Desktop girip nguoi
dung khic phuc dugc sp hing tfng khi mu5n tim mQt tii liQu md kh6ng nh6 rd
ki6u file Hcrn nira, n6u chuong trinh khdng cho ra k6t qui mong mu6n, mQt cri nh6p chuQt vio du<vng link trOn cua s6 Google Desktop s€ gtii truy v6n d6n ngudi
Nguoc lpi, m6i lAn nguoi dirng dua ra m6t lQnh tim ki6m tr6n giao diQn th6ng thulng cria Google, k6t qui tim ki6m tir Google Desktop s€ dugc b6 sung tp
dQng <r phia tr6n mdn hinh
2.2.2 Nhtqc diAm
Sg ri6ng tu cira bpn sE bf soi m6i (khi b4n sri dpng chung mdy tinh vdi ngudi
kh6c Google Desktop hi6n thi tht cir dii liQu cria mgi ngudi, tru phi duqc thi6t lap preferences di5 lo4i bo mQt si5 thu muc hoic dinh dang Vi vfy, ntiu ddy li trubng
2.3 C6ng cg tim ki5m cria Ping
2.3.1 Uu di6m
Tim ki6m Ti6ng ViQt cira Ping nhin chung kh6 t5t Ping c6 vC uu tiOn nQi dung hcrn vd c5c k! thupt thdn thiQn v6i m5y tim ki6m
Trang 14Chuong t: T6ngquan
V6i m6i ki5t qui tim ki6m, b4n c6 mQt si5 th6ng tin mi6u ti chi tii5t hcrn vd trang,thudng li c6c mi6u ti snippet md Google hay su dgng vd d{t bigt la sitelink C6c
sitelink niy minh nhan thAy Bing to raxhc dfnh don gi6n hon dga vio menu vd s5 l6n
xuAt hign hon li ngir nghia r,hung lai hi6u qui kh6ng k6m Google, thQm chi ld cdn
Chua c6 phi€n bdn ti6ng vi6t, ld c6ng cu cdn non k6m.ViQc t6ch loc tiiing ViQt
trong c6c dlnh d4ng phirc tpp hon nhu Flash, PDF chua dugc ttit
Ciic c6ng cg h5 trg Webmaster cria Microsoft cdn b6o thir vdi cdc dinh dpng vd
.A
3 MUC TrEU VA NHrEM VV CUA KHOA LUAN
3.1Mgc ti6u
NQi dung kh6a lupn ld mdn hoc khri m6i chua dugc hgc trong chuong trinh, cAn
du-o c nghiCn cr?u vd hgc h6i Girip chring ta rdn luyQn k! nlng ph6n tich, xu ly v6n dA
vi khi ndng tu duy itQc lfp, cirng v6i sy trg girip cria gi6ng vi€n hudng din tin ring sC
Tim hi€u vC hQ th6ng tim ki6m th6ng tin
,, HQ thdng tim kii5m tfp tin tr6n m6y tfnh c6 nhdn.
,,, BQ phan lAp chi mpc C6c phuong ph6p ldp chi mqc.
, Phuong phrip luu trf b0 chi mgc.
Trang 15Chuong 1: Tdngquan
" LQp trinh tr6n Windows Form: Dgc nQi dung tpp tin vin bin (Word,
HTML, HTM, TXT, RTF )
" Luu trfr chi mgc theo c6u truc B-Tree
4 HTIoNG GrAr euyEr
Dti x6y dtmg c6ng cp tim kii5m t4p tin vdn bin ta c6 nhirng huong giii quytit nhu
sau:
o Ng6n ngt lfp trinh C#.
o Lpp chi muc bing phuong phrip chi muc nghich d6o
Trang 16Dinh nghia hQ th5ng tim ki6m th6ng tin cua mQt s6 t6c giit:
Salton (1989): "H€ th6ng tim ki6m th6ng tin xtr ly cictpp tin luu vd nhirng y6u cAu
ve th6ng tin, x5c dinh vd tim tir c6c tpp tin nhirng th6ng tin phi hqp v6i nhfrng y6u c6u v0 th6ng tin ViQc truy tim nhirng th6ng tin d{c thi php thuqc vi sp tuong thc gifia cicth6ng tin dugc luu tr& vi c6c y€u cAu, dugc dSnh gi6 bing c6ch so s6nh c6c gi5 trf cria c5c thuQc tinh dugc luu trft vd c6c y6u ciu vd th6ng tin"
Kowalski (1997): "HQ th5ng truy tim th6ng tin ld mQt hq thiing c6 khi ndng luutrfr, truy tim vd duy tri th6ng tin Th6ng tin trong nhirng trucrng hgp ndy c6 th6 bao g6m vrn bin, hinh 6nh, 6m thanh video vi nhfrng dtii tugrrg da phucrng ti6n kh6c."
"MOt hQ th6ng tim ki6m th6ng tin ld mQt phAn m6m girip ngudi su dpng tim ki6m
th6ng tin (information) h9 cAn Hay mQt hQ thdng th6ng tin girip ngudi sir dung tim
ki6m nhirng tii liQu (document) chr?a nhtng th6ng tin (information) hg c0n vi ngudi su dung sE rut trfch th6ng tin cAn thi6t fi nhirng tdi liQu d6."
"Truy tim th6ng tin li tim ki6m th6ng tin (thucmg li c6c tdi li€u) o mQt d4ng kh6ng
c6 c6u truc (th6ng thudng ld v5n bin), th6a m6n nhu c6u th6ng tin tir trong nhirng
ngu6n thdng tin lon (duo c luu trft tr6n c5c m6y tinh)." [irbookprint]
ti6ng n6i, tim ra rrhirng tQp c6 nQi dung li6n quan tltin mQt v6n <t€ (c6u h6i) ta c6n bitit
(hay tri loi) Di6n hinh cria c6ng nghQ niy li Google, mQt he tim ki6m th6ng tin tr6n
Web, md hAu nhu chring ta d6u dung thuong xuy6n.
Hi6u don giin hQ th5ng tim ki6m thdng tin li mQt hQ th6ng h5 trg cho ngutrisfr dgng tim ki5m th6ng tin mQt cich nhanh ch6ng vi d6 ding.
1.1.2 Mrlc ti6u vi chrfrc ning
Trang 17Chuong 2: Co so l!,thuy€t
MUc ti6u cira mQt h€ th6ng tim ki6m thdng tin ld tim ki6m vi tri vA c6c tdi liQu tri
ldi cho nhu cAu thdng tin cira ngudi ding
MOt he th6ng tim ki6m th6ng tin c6 hai chric ning chinh d6 le lpp chi mpc vi tra
cr?u hay tim ki6m:
o Lap chi mpc li giai do4n phin tich tni liQu dC rut trich c6c <Ion vi th6ng
tin tir tii liQu vd biOu di6n lpi tii li6u boi c6c dcrn vi th6ng tin d6 Ecrn vith6ng tin
c6 thti le "tt", hopc phric tpp hon ld "cpm tli", ookh6i niQm",
o Tra criu (hay tim ki6m) ld giai dopn tim ki6m trong co sd dfr liQu Oieu
di6n cho cfuc titi lieu boi c5c don v! th6ng tin trong chric n6ng lap chi muc) nhirng
tdi liQu pht hqp v6i nQi dung c0u truy v6n.
1.1.3 Ki6n trric
Theo I thuy6t thi nQi dung cdu truy v6n c6 th6 so srinh tryc ti6p vdi nQi dung cta
Nhtmg tr6n thyc tis thi kh6ng thti thpc hiQn nhu vpy dugc vi c5c ciu truy v6n vir cric
v6n vi tni liQu c6 th6 tu duy ra dugc m5i quan hQ giffa chring.
Vi vpy trong ki6n truc cria mQt hQ th6ng tim kii5m th6ng tin c6 bQ phpn trung gian
d0 thfc hiQn c6ng vi6c chuytin tlOi c6c th6ng tin cira tii liQu thinh thdng tin du-o c biiiudi6n trong hQ th6ng, d6 le b9 phpn lpp chi mgc.
c6u hoi (truy v6n) cria ngudi ding thdnh nhirng th6ng tin gdn gi6ng nhu trong hQ th6ng
di luu trfr, sau d6 so sSnh th6ng tin ndy vdi th6ng tin c6 trong hQ th6ng de tra ve t6t
que.
a) B0 phqn lQp chi muc
NhiQm vU chfnh cria bQ ph4n niy li: dgc nQi dung tu mQt kho tdi liQu l6n, t6 etin
phdn tich, thting k6, bii5u diSn lai nQi dung ctia ttng tdi liQu trong kho vd luu nhirngth6ng tin niy vdo co so dt liQu theo mQt c6u trirc nio d6 theo quy dinh cira hQ th6ng.
Co so dfr li0u ndy ggi ld tap tin chi muc cta hQ thdng.
Trang 18Chuong 2: Co so l!,thuyAt
C6 nhi6u c6ch dO bi6u di6n lai nQi dung cria tdi li6u trong tpp tin chi muc nhu: nQi dung g5c, rfit trich ngt nghia, rut trich dflc trung cria tdi liQu,
tpp tin d6 dC l6p chi muc; tii liQu ld hinh enh hoic 6m thanh thi chirng ta l6y ra d{ctnmg cira tdi liQu g6c Ae tpp chi mgc.
TQp tin chi muc sC dugc hQ th6ng luu trt, quin li, bd sung, cQp nhflt, vi qu6
trinh tim ki6m sE diSn ra tr6n nQi dung cria t4p tin ndy mi khdng cdn t5c dQng d6n c6c
tii lieu g5c trong kho.
b) B0 phdn tim ki6m
Th6ng tin ngudi dung cAn tim sE dugc nh$p thdng qua mQt giao diQn (c6 the legiao diQn web), thong tin niy co th6 0 d4ng mQt c6u hoi gAn gi6ng v6i ng6n ngfr t.u nhi€n hay mQt c6u trfrc nio d6 do hQ th6ng qui udc C6c cdu hoi niy s€ dugc hQ th6ng
phdn tich vi bi6u di6n lpi o mQt hinh thric m6i gi6ng nhu phAn lap chi mpc, khi <16 nQi dung dugc bi6u di6n l4i niy "d6ng d?ng" v6 mpt c6u truc v6i tpp tin chi mgc vi chfng
ta cfing dC dang so s6nh tryc tiifp gifra hai sg bi6u di6n ndy.
l.2lJlg thting tim ki6m trGn m6y tinh ci nhin
1.2.1 Girfi thiQu
HQ th6ng tim ki6m tdi liQu tr6n miiy tinh c6 nhdn ld tim kii5m th6ng tin ngudi
ding tr6n c6c lopi t4p tin vdn bin ptrO Uitin nhu Word, HTML, HTM, TXT, RTF vd
hi6n thi nhfing tpp tin c6 1i6n quan d6n tir khori ngudi ding nh4p vio
1.2.2 C6c vin dd
HiQu qui cria qu6 trinh lpp chi muc cho c6c tdi liOu luu trfi tr6n dia phu thuQc r5t
nhiAu vio ytiu t6 b9 doc cho ttng lo4i tpp tin kh6c nhau Do c6c lo4i tdi li6u kh6c nhauthi c6u truc kh6c nhau vd v6n d0 doc chinh x5c nQi dung vd c6u truc cria chting ld tucrng
d6i kh6 kh6n.
Tucmg ty nhu cric hQ th6ng tim ki6m tr6n web, hQ th6ng tim ki6m tr6n m6y tinhc6 nhdn cflng vfly, qu6 trinh lpp chi mUc diSn ra nhanh hay chgm tuj,thuQc vdo ki6nthric vd kinh nghiQm cta nguoi xiy dpg hQ th6ng vC nhi6u y6u tri nhu: c5u tnic chimqc (c6u tnic ciy, bing b6m, ), c6u trric file AC tuu trir chi mqc (file dpng text, nhi
ph6n )
Trang 19Chuong 2: Co sd l!,thuy€t
1.3 MOt s6 v6n rld trong tim ki6m thdng tin
Ke fir nhirng nim 40, c6c v6n dO trong viQc luu trt th6ng tin vd tim kii5m th6ngtin itd thu hirt sp chir y rdt l6n V6i mQt lugng th6ng tin kh6ng 16 thi viQc tim ki6mchinh x5c vir nhanh ch6ng cing tro nOn kh6 khdn hon V6i sU ra ddi cria m5y tinh, rit
chinh x6c Tuy nhi6n v5n dC tim kii5m sao cho hiQu qui vdn chua dugc gi6i quy6t.
Vd nguy6n t6c viQc luu trft vi tim ki6m th6ng tin thi dcrn gi6n Gii su c6 mQt kho
chria c6c tdi liQu vd mQt ngudi mu6n tim cric tdi liEu li6n quan d6n y€u c6u cta minh
Ngudi doc c6 th6 dgc tdt chc6c tdi liQu trong kho, gifi lpi c6c tii liQu li6n quan vd bo di
2 THI/C HANH
2.lLfp chi mgc
HQ th6ng lpp chi muc hay cdn ggi ln hQ th6ng phin tich vh xtr li dt liQu, thgc hiQn
tu quan trgng) tu kho tii lieu thu th4p dugc vd t6 chr?c thinh ccy sd dfi liQu ri6ng <tO c6
c6c tir kho6, chi rd c6c tu khoil niro xu6t hiQn trong t4p tin nio, vi tri nio
2.1.1 M6 hinh tra cru luqn ly
C6ng viQc chri yiSu khi xdy dpg mQt hQ thting tim ki6m li viQc lap chi mpc cho
kho tii li6u ctia chring ta, sau qu6 trinh lpp chi mgc thi nQi dung cria chi mpc sC biiSu diSn nQi dung cria kho tdi li6u o mQt dpng c6u truc ndo d6 do hQ th6ng quy <I!nh.
Trang 20Chrong 2: Co sd $ thuyiit C6u truc chi muc co bin vi dcrn gian nh6t li c6u truc chi mpc theo m6 hinh tra
di6n cho tdt chc6c tu (word) xu6t hiQn trong tii liQu.
DC minh hoa cho qu6 trinh xdy dpg chi muc theo c6ch tr€n, ta xem vi du sau d6 hi6u 16 v6n d6: tim kii5m t6n c6c nhdn vft trong c6cthcphAm cria nhd v[n Shakespeare.
xdy dpg bing th6ng kd g6m t6n nhdn vflt vd t€n c6c vo kfch Bing thting kC c6 th6
dugc m6 ti nhu sau:
'l'eic phim
r\ntorr,v []rutus ('easar Calpumia ('lcopatra Mercy' w0rser
Anton-\, anrl Clleopatra
00
0r
{lt 00 {i0 II l1
Julius 'l'h* I lamlet Othello N4acbcrh
I 0
0
0
0 t)
I
0 0 I
I
0
Bdng 3 t BiAu di6n cdu ffilc ltru tric chi mqtc theo m6 hinh tra ctat luQn t!'
Circ gi6 tri trong ma trQn (ki6u boolean) dnng dC bi6u di6n cho su c6 m{t cira nh6n vpt trong vo kich Gi6 tri ndy li 1 (true) niSu nhdn vft c6 trong vo kfch, ngugc lpi
gi6 tri ndy li 0 (false) ni5u nhin v4t kh6ng c6 trong vo kich
Chfng ta c6 c6c vector bits m6i tu bi6u di6n trong kho tii liQu Vi du tu
"Antony" le 110001 (1 ddng thuQc ma trfn tr6n).
Trong m6 hinh Boolean chring ta c6 th6 d{t ra U6t tcy truy v6n nio dudi dpng
m6t bii5u thfc logic tr6n tir ngt, trong d6 circ tri ngfr ndy k6t hqrp vdi nhau boi crlc phdp
Trang 21Chuong 2: Co so lj,thuyAt
Gi6 su y6u cAu tim ki6m ld c6c vo kich c6 chria 'oBrutus" vi "Caesar" nhtmg
"kh6ng c6 Calpurnia" Khi ti6 c6u truy v6n c6 th6 bi€u ditin nhu: Brutus AND Caesar
AND (NOT Calpurnia)
Gi6 tri 1 the hiQn vo kich o vi tri tuong img li ktit qui cin tri ve Oi6n gi6i cho dey bit tr€n li vo kfch <r vitri thri I "Antony and Cleopatra" vir vf tri thri 4 "Hamlet" li
ki5t qui tri vii
o Ktit qui cira m6 hinh ndy le 1 tei li6u se dugc x6c <tinh li dirng hay kh6ng dring Otii vOi 1 c6u truy v6n tim ki6m (tim chfnh x6c) Hon nfra c6c tdi liQu tr6 vC chi c6 th.5
dugc s8p xtip dga vdo id cira c6c tii liQu d6.
o C6u truy v6n ph6i o dpng cria mQt biiSu thric logic vd nhu th6 se r6t kh6 khln chongudi sri dgng th6ng thudrng.
o M6 hinh tra cr?u lupn l;f sri d\rng m6t ma trpn nhi phdn tlO bi6u diSn tpp chi mUc c6c tii liQu, viQc x6y dlrng ma trdn ndy ld rilt "ldngphi" Gi6 su chring ta c6 1 triQu tii
chring ta phii x6y dpg mQt ma tr4n g6m lM x 0,5M : 500 tj, phAn tir, trong d6 chi c6
1 tf phin tti mang gi6 tr! 1.
2.1.2 Phrong phdp chi mqc nghich ddo
V6i nhirng h4n ch6 d5 n6u, <tac biet li <r qu5 trinh lap chi muc tii liQu, ta nhpn
th6y m6 hinh tra ct?u lufln ly kh6ng th6 6p dqng d€ xdy d1mg c6c hQ th6ng tim ki6m lcyn.
phdp niy sO du<yc nghiCn cftu & phAn nQi dung thuc hiqn).
Trang 22Chuong 2: Cct sd ty thuyiit
Trong thrc t6 c6 nhi€u chi muc kich thudc cria n6 ld r6t lon (thgc t6 1 tabletrong thgc t6 s5 record cira n6 khoang hon I tripu ddng ld diAu binh thucmg) n6n cin phii du-o c luu tr6n cric thi6t bi phAn ctmg (thi6t bi nh0p xuSt - HDD), ngdy nay khi md dung luong cria c6c thilit bi ltru trii ting l6n r6t nhanh vd gi6 thenh 16 n€n cAn 1 phucrng
ph6p truy xdt nhanh tr6n 1 lugng dir liQu lon nhu vay.
B-Tree li 1 c6u truc dfi liQu d4ng cdy 'da phdn c6n blng'c6u truc 1 node cta n6 thuongdugc chia lim nhi6u phAn, cdy dugc t6 chric voi I qui tic nh6t <I!nh dugc t4o ra tu diu(thuong thi gi6 tri cria c6c node tdng din tu trrii qua ph6i, node con b6n tnli sE chria gi6
tri nh6 hon node con b6n phai )
B-Tree li 1 sU t6ng qu6t h6a cria c6y nhi phdn tim kiiSm @ST) sg kh6c biQt
chinh gita chting li s6 lugng node con cria B-Tree n6 sC nhi6u hcrn chr? kh6ng bi gioihpn boi 2 phdntu nhu cria cdy nh! phin tim ki6m Muc ti€u o ddy li lim sao d6 gi6mttii da sti tAn truy xu6t cric thitit bi ngopi vi o ddy ld fia, tpi b6t cri thoi di6m nio chfng
ta sE thir ilinh vi nhfing bin ghi, chring ta mu5n dQ cao cira ciy da phin ndy phii le nh6
*r6t Ae giim chi phi tim kii5m, de dat duoc muc ti6u niy thi sd lugng nh6nh trong c6y
Ni5u B-Tree c6 bfc ld m thi I node ctra n6 sE c6 tOi ea m nh6nh con vd c6 m-l kh6a <16
tim ki6m
o C6c reord chi ilugc luu 0 nrit 16 cria cdy
" T6t ci cfuc nirt (ngopi tru ntit g6c) c6 tOi tnieu DL rnlz vi t6i da ld m ciy(nh5nh)
con.
o Nrit g6c vd 16 ph:ii it nh6t c6 2 con.
o B6n trong 1 node luu gi6 tri tim ki6m, vdi chi dugc sri dung nhu nhtng "ngudichi dudng" cho qu6 trinh tim ki6m, nhtng kh6a niy cin phii dugc s6p x6p theo 1 thf
tg thting nh6t tu tru6c (thucmg thi t[ng dAn tri tr6i qua phai)
cria B-Tree bpc m c6 th6 luu nhi6u hon ho{c it hon m records Th6ng thudmg kich
Trang 23Chuong 2: Co sd tf thuydt
thu6c cria 1 node ld bQi sO cria I ttri5i af fiQu trong O Aa ft6n windows m{c <tinh ld
5l2byte)
o Cdc node 16 cria cdy thudng duqc li6n kt{t voi nhau de e" thenh I danh s6ch li€n
tiSt, aieu nny nhim mgc ti6u la clti khOng phni duyQt tro lgi node g6c trong qu5 trinh
drryQt tim kitim
2.2.1 DinhnglrTo
MQt B-hee bAc N co cdc d{c finh sau:
(+) M6i node c6 tOi <Ia N-l
Vf dp:
Ilinh 2.1 B-tree bdc 2 cd 3 m*cKhai brio:
#define ORDER 5
{
int NumTree; ll sd cdy con crla node hiQn hdnh
int Key[ORDER-ll; ll mdng lmt trii cdc khod cfia node
tagNODE* Branch[ORDER]; // cdc con trd chi d€n cdc node con
) NODE, *NODEPTR, +BTREE; ll contrd node
NODEPTR Root;// contrd node gdc
Trang 24Chuong 2: Co sd ly thuyAt
2.2.2 Tim kiilm ffong B-tree
Ifinh 2.2 : Tim kiiim gid ti X
X6t node trong hirdr.2.2, kh6a c6n tim li X Gia sri nQi dung cria X nim trong bQ
nh6- Voi m dri lon ta sri dgng phuong ph6p tim ki6m nh! phen, ntiu m nh6 ta sri dung
phuong phrip tim kitim tuen tg
(+) Km < X
(+) X 1Kr.
N6u X kh6ng th6y sE co ba trudng hqp sau:
(+) Ki < X < K;*1 Tiiip tqc tim kiiSm trin c6y con Ci
Tiiip tgc tim ki6m tr6n Cm
Ti6p tqc tim ki6m tr€n C6 Qu6 trinh ndy ti6p tuc cho <Iiin khi node dtng duo c tim thAy Ntiu di di diSn node
15 md v6n kh6ng tim thAy khoq viQc tim ki6m h th6t bai.
a) Hdm NODESEARCH
Tra v€ v! tri nh6 nhSt cria kh6a trong nrit p bit dAu lon hon hay bing k Trudng hqp
k lon hon dt cir cickh6a trong nrit p thi tra vc vi tri p-> NumTrees-l
int NodeSearch (NODEPTR p, int k)
Him NodeSearch tlugc dirng aC tim kh6a k c6 trong nrit p hay kh6ng N6u kh6a k
kh6ng co hong nut P thi ph6p toiin ndy tri vC vi tri girip chfng ta chgn n6t con phn hqp
cria p a6 ti6p tUc tim kh6a k trong ntt con niy
b) Hdn SEARCH
Tim kh6a k tran B-Tree Con tn6 p xuSt ph6t t& gtic vd tti xuting c6c nhrffi c6y con ph[r hqp iI6 tim kh6a k co trong mQt n6t p hay kh6ng.
N6u c6 kh6a k t4i nrit p tr6n cdy:
+ Bi6n found tra vA g16 H TRUE
Trang 25Chuong 2: Cct so lj,thuydt
+ Him Search( ) tr6 vC con tr6 chi nft p c6 chria kh6a k
+ Bi6n position tri vC vi tri cira kh6a k c6 trong nirt p ndy N€u kh6ng cd kh6a k trOn cdy: liuc niy p : NULL vi q (ntt cha cira p) chi ntit 16 c6 th6th6m kh6a k vio nrit niry dugc.
' + Birin found tri vC giltriFALSE
+ Hdm Search( ) tri vC con tr6 q li nrit 16 c6 th6m nft k viro
+ Bitin position trd v€ vi tri c6 th6 chdn kh6a k vio nrit ki q ndy
NODEPTR Search(int b int *pPosition, int *pFound)
*pPosition : i; ilL'i rt'i ti*,t thriv l;kd* k
return p; ii Node t:ti chwt *lxi* K)
q: p;
p: p ->Branch[i];
)
ri' ({1ung tnet i}tiir' lirr: na3 p 'N{ll.l." r,a ul !a nrttle lii *ri rltc tl:rnr khriii k
r,ii* node ntirr'" pitosizir:n l;\ r,i tri u* tl:c *hin khi>a k *,,
*pFound : FALSE;
*PPosition: i;
return g, I lit r * notlc l*
)2.2.3 DuyQt cdy B-tree
DuyQt c6c kh6a cira B-Tree theo tht t.u tu nh6 <ltin lcyn-bing phu<rng ph6p dQ qui
t
if (pRoot:: NULL)retum;