Qua trinh phan lo^i van ban duac chia lam 3 giai doan chinh:-Giai doan chuan bi du lieu: Thu thap tap dO lieu mau, tach tir, tinh cac trong so tu, loai bo cac tu tarn thudng khong cd y n
Trang 1An Giang, 05/2012
thUvien
TRUdNG DAI HQC
AN GIANG
TRlTdNG DAI HOC AN GIANG
KHOA KY THUAT - CONG NGHE - MOI TRlTOfNG
DlTONG THANH TRlTC - DTH082062
KHOA LUAN TOT NGHIfiP DAI HOC NGANH CU* NHAN TIN HOC
TIM HIEU CAC KY THUAT PHAN LOAI
VAN BAN TIENG VIET
Giang vien huong dan
TS Nguyen Van Hoa
Trang 2Sinh vien
Duong Thanh True
Truac tien, em muon giii lai cam on sau sac nhat den thay giao, Tien si NguyenVan Hoa nguoi da tan tinh huang dan em trong suot qua trinh thuc hien khoa luan totnghiep
Em xin bay to lai cam an sau sac nhat den thay Ths Ho Nha Phong, co giaoThs Nguyen Thi My Truyen cimg nhung thay co giao da tan tinh giang day em trongbon nam qua, nhung kien thiic ma em nhan duac tren giang dirang dai hoc se la hanhtrang giiip em vung buac trong tuang lai
Cu6i cimg, em muon gui lai cam on chan thanh den tat ca ban be, va dac biet lacha me va em gai, nhung nguai luon kip thai dong vien va giiip da em vuat qua nhungkho khan trong cupc song.
LCfl CAM
Trang 3Qua trinh phan lo^i van ban duac chia lam 3 giai doan chinh:
-Giai doan chuan bi du lieu: Thu thap tap dO lieu mau, tach tir, tinh cac trong so
tu, loai bo cac tu tarn thudng khong cd y nghia phan loai, lira chpn cac dac trung.-Giai doan huan luyen du lieu: Xay dimg cac mo hinh phan loai, tuy theophuang phap duac chpn ma se cd each xay drag bp phan loai khac nhau
-Giai doan phan lap danh gia: Thu nghiem cac mo hinh phan loai da xay dungtren nhrag van ban mdi, tinh toan dp chinh xac phan loai tir do tim ra each cai tien cac
mo hinh phan loai
Toi da thtrc nghiem tren 5 chu de: giao due, phap luat, sue khde, thl thao, vi tinh.Vdi mdi chu dd toi thu thap 200 van ban mau lam tap du lieu hoc va kidm tra (tdngcpng 1000 van ban) Sau khi su dung phuang phap hold-out (lay ngau nhien A/3 t|p dulieu de hoc va 1/3 tap du lieu con lai dung cho kiem tra, lap lai qua trinh nay 3 Ian rdilay gia tri trung binh) de danh gia hieu qua cua cac bp phan loai theo hai phuang phapSVM va phuang phap thong ke thu dupe ket qua:
Trang 4MUCLUC CHUONGUTONGQUAN'.1
1.1 BatvindS1
, 1.2 Lich su giai quyet van de.1
1.3.Pham vi cua de tai2
1.4.Phuang phap nghien cuu/ hudng giai quyet van d2
CHUONG 2: CO SCS LY THUYET4
2.1.Gidi thieu bai toan phan Ioai van ban tiSng Viet42.2.Mo hinh phan Ioai van ban42.1.1.Giai doan chuan bi dii lieu42.1.2.Giai doan hu^n luyen5
2.1.3.Giai doan phan lop vadanhgia52.3.C^c cong viec chinh trong qua trinh phan Ioai62.3.1.Chuin hoa van ban6
2.3.2.Tachtir6
2.3.3.Bieu dien van ban8
2.3.4.Trich chpn dac tnmg92.4.Cdc phuang phap phan Ioai van ban112.4.1.Phuang phap k lang giSng gn nhlt (kNN)11
2.4.2.Phuang phap Naive bayes122.4.3.Phuang phap cay quyet dinh:132.4.4.Phuang phap may hoc vecta ho tra (SVM)14
CHl/ONG 3: N0I DUNG VA K3ET QUA NGHIEN CU"U18
3.1.Qua trinh xay dung bo phan Ioai183.1.1.Mo hinh cac buac thuc hienphan Ioai183.1.2.Xay dvrng tap du lieu183.1.3.TiSnxulyvanban183.1.4.Lua chon dac trung20
3.1.5.Mo hinh hoa khong gian vector203.1.6.Xay dung bp phan Ioai213.1.7.Thu nghiem va danh gia213.2.Xay dung he thong phan Ioai van ban213.2.1.YeucSu—.21
3.2.2.Phantich223.2.3.ThiStkS128
Trang 53.3.1.Banh gia cac giai thuat553.3.2.So sanh cac giai thuat57
KET LUAN VA HUCJNG PHAT TRIEN61
TAI LIEU THAM KHAO'.62 PHU LUC A: DAC TA USECASE63
PHU LUC B: DANH SACH TIT THUC5NG88
Trang 6DANH SACH HINH VE
Hinh 1: Gdn nhdn cho cac tdi lieu van ban4Hinh 2: Mo hinh giai doqn chudn bi die lieu5Hinh 3: Mo hinh giai doqn hudn luyen5Hinh 4: Mo hinh giai doqnphdn lop6Hinh 5: Biiu diin van ban„8Hinh 6: Mat sieu phdng phdn tdch cac mdu duong khoi cac mdu dm14Hinh 7: Mo hinh cac btcoc thuc hienphdn loqi van ban18Hinh 8: Usecase tdng quan23Hinh 9: So do Usecacse chitc nangphdn loqi24Hinh 10: So do Usecase chitc ndng quan ly die lieu24Hinh 11: So do Usecase chiec ndng quan ly dqc trieng van ban25Hinh 12: So do Usecase chic ndng quan ly dqc trieng chu di25Hinh 13: So do Usecase chic ndng quan ly ti thieong26Hinh 14: So do Usecase chic ndng quan ly ti biiu diin26Hinh 15: So do Usecase chic ndng quan ly tap die lieu hoc27Hinh 16: So do Usecase chic ndng quan ly bo phdn loqi27Hinh 17: So do Usecase nhom chic ndng dang nhdp hi thong28Hinh 18: Kiin true hi thing28Hinh 19: So do chic ndng he thing29
Hinh 20: So d6 giao dien he th6ng30
Hinh 21:Giao diin chinh chitong trinh30Hinh 22: Giao diin phdn loqi van ban31Hinh 23: So do hoqt dqng chic nangphdn loqi van ban32Hinh 24: Giao dienphdn loqi van ban32Hinh 25:So do hoqt dqng chic nangphdn loqi van ban33Hinh 26: Giao dien thim chu di moi33Hinh 27:So do hoqt dqng chic ndng thim chu di34Hinh 28: Giao diin quan ly chu di35Hinh 29:Scr d6 hoat dong chuc nang quan ly chu de35Hinh 30: Giao diin thim van ban moi36Hinh 31: So do hoqt dqng giao diin thim van ban37Hinh 32: Giao diin quan ly van ban38Hinh 33:So do hoqt dqng giao diin quan ly van ban39Hinh 34: Giao diin quan lyddc trieng van ban40Hinh 35: So do hoqt dqng giao diin quan lyddc trieng van ban41Hinh 36:Giao diin quan ly dqc trieng chu di41Hinh 37: So do hoqt dqng giqo dien quan lyddc trieng chu di42Hinh 38: Giao diin quan ly titthitdng43Hinh 39: So do hoqt dqng giao diin quan ly tie thieong44Hinh 40:Giao diin quan ly tie biiu dien van ban44Hinh 41: So do hoqt dqng giao diin quan ly tit biiu diin45Hinh 42: Giao diin quan ly tap die lieu hoc46Hinh 43: So do hoqt dqng giqo diin quan ly tap die lieu hoc47Hinh 44: Giao diin xudt tap die lieu47Hinh 45:Sodo hoqt dqng giao dien xudt tap die lieu hoc48
Trang 7Hinh 46: Giao dien xdy dung bo phan loai tuddng49Hinh 47: Sa dS hoat donggiao dien xdy dung bophdn loai tu dong.-50Hinh 48: Giao dien qudn ly bophdn loai50Hinh 49: Scr do hoat donggiao dien xdy dung bqphdn loai tu dong51
Hinh 50: Sadd quan he (CSDL)51
Trang 8'•*DANH SACH BANG BIEU
Bang 1: Chudn hoa bo ddu19Bang 2: Danh sack cdc Actor23Bang 3: CdU hinh he thong=29Bang 4: Sir dung du lieu giao dien phdn loaivan ban31Bang 5: Sir dung du lieu giao dien phdn loai thu muc32Bang 6: Sir dung du lieu giao dien them chu di mai33Bang 7: Sir dung du lieu giao dien qudn ly chu de35Bang 8: Sir dung du lieu giao dien them van ban mai36Bang 9: Sir dung du lieu giao dien qudn ly van ban38Bang 10: Sir dung du lieu giao dien dqc trung van ban40Bang 11: Sir dung du lieu giao dien dqc trung chu de42Bang 12: Sir dung du lieu giao dien qudn ly tir thuang43Bang 13: Sir dung du lieu giao dien tir biiu dien van ban:.45Bang 14: Sir dung du lieu giao dien qudn ly tap du lieu hoc46Bang 15: Sir dung dulieu giao dien xudt tap du lieu hoc48Bang 16: Sir dung du lieu giao dien tudqngxdy dung bo phdn loai49Bang 17: Su dung du lieu giao dien qudn lybqphdn loai51Bang 18: Cdu true bang chu de52Bang 19: Cdu true bdngvdn ban52Bang 20: Cdu true bang dqc trung chu de53Bang 21: Cdu true bang dqc trung van ban53Bang 22: Cdu true bang tir bieu dien54Bang 23: Cau true bang tir thuang54Bang 24: Cau trite bang bq phdn loai54Bang 25: Cdu true bang tdi khodn55Bang 26: Ma trdn confusion trinh bay kit qua phdn loai gidi thudt SVM Ian 155Bang 27: Ma trdn confusion trinh bay kit qua phdn loai gidi thudt SVM Idn 256Bang 28: Ma trdn confusion trinh bay kit ^ua phdn loai gidi thudt SVM Ian 356Bang 29: Ma trdn confusion trinh bay kit qua phdn loai gidi thudt Thong ki Idn 1 56Bang 30: Ma trdn confusion trinh bay kit qua phdn loai gidi thudt Thong ki Idn 2 51Bang 31: Ma trdn confusion trinh bay kit qudphdn loai gidi thudt Thongki Idn 3 51Bdng,32: Usecasephdn loaivan ban64Bang 33: Usecasephdn loai thu muc65Bang34: Usecase thim chu de.'.65Bang 35: Usecase xoa chu di67Bang 36: Usecase thim van ban68Bang 37: Usecase xoa van ban69Bang 38: Usecase tdch tir70Bang 39: Usecase Chudn hoa van ban71Bang 40: Usecase loai tir thuang van ban72Bang 41: Usecase tim dqc trung van ban73Bdng42: Usecasexoa dqc trungvdn ban73Bang 43: Usecase tim dqc trung chu di74Bang 44: Usecasexoa dqc trung chu di75Bang 45: Usecase thim tir thuang76
Trang 9Bang 46: Usecase Xda ticthiccmg:.•:77Bang 47: Usecase Xoa tie biiu diin78Bang 48: Usecase xem danhsdch tap die lieu hoc79Bang 49: Usecase tao tap die lieu hoc80Bang 50: Usecase sao Iteu tap die lieu hoc81Bang 51: Usecase phuc hoi tap die lieu hoc82Bang 52: Usecase xudt tap die lieu hoc83Bang 53: Usecase xay dung bo phan loai tie dong84Bang 54: Usecase tao bo phan loai85Bang 55: Usecase ddnh gid bo phan loai86Bang 56: Usecase thie nghiem bo phan loai87Bang 57: Danh sdeh tic thitcmg89
Trang 10Co so du lieu
Bo phan loai
Bieu diln
Dae tnmg chii dSTap da lieu hocDae tnmg van ban
ChudS
Van banSupport Vector Machine
Tfr day du SACH CAC TIT VIET TAT
Trang 11SV: Duong Thanh True - DTH082062Trang 1
CHtTGNG 1: TONG QUAN
1.1.Bat van dl
Cong nghe thong tin xuat hien da lam thay doi ca the gidi, mpt 6 ciing chi bangban tay co the chlia lupng du lieu bang ca mot can phong ldn vdi day sach Ngay nay,nhilu ngu6n thong tin duai dang van ban da dupe chuy&i din sang dpng du lieu dupeluu trii tren may tinh hoac truyen tai tren mang Bai vi nhung im diem: Lim trii gpnnhe, thai gian luu trii lau dai, thuan tien trong su dung va trao doi, nen nguon dulieu nay tao thanh mot khoi lupng khdng 16 cac thu vien dien tii, thu dien hi (email),world-wide-web, va cac du lieu dupe luu trii tren may tinh ca nhan, Cimg vdi su giatang ciia so lupng van ban, nhu cau tim kiem van ban cung tang theo Khi do, phanloai van ban tu dpng la mot yeu cau cap thiet dupe dat ra Phan loai van ban se giupchting ta tim kiem thong tin mot each nhanh chdng hom thay vi phai tim Ian lupt trongtimg van ban, hem nua khi so lupng van ban dang gia tang mot each nhanh chong thithao tac tim Ian lupt trong timg van ban se mat rat nhieu thai gian, cong sue va la motcong viec nham chan va khong kha thi Chinh vi the nhu cau phan loai van ban tudpng la thuc su can thiSt
Bai toan phan loai van ban co y nghTa rat quan trong trong viec xti ly du lieu vanban va dupe ling dung rpng rai trong nhi^u linh vuc nhu: Tim kiem, trich lpc thong tin,lpc spam e-mail, phan loai e mail, phan loai tin hie tu dpng va no con la ca sa, dpngluc thuc d^^y cac linh vuc nghiSn ciiu khac phat triln
Bai toan phan loai tu dpng la mot trong nhung bai toan kinh dien trong linh vuc
xu ly du lieu van ban Bay la van de co vai tro quan trong khi phai xii ly mot so lupngIan du lieu Tren the giai da co nhi6u cong trinh nghien ciiu va dat dupe nhung kit q^uakha quan v6 huang nay Tuy vay, cac nghien ciiu va ling dung d6i vai van ban ti^ngViet con nhi6u han ch6 Phln nhieu ly do la dac thu cua tiSng Viet tren phucrng dien tuvung va cau Co nhi6u phuong phap phan loai van ban da dupe sir dung nhu: Quyetdinh Bayes, cay quyet dinh, k-lang gieng, mang noron, Nhung phuang phap nay choket qua co the chap nhan dupe va dupe sii dung trong thuc te Trong nhung nam g4nday, phuang phap phan loai su dung Bp phan loai vector ho trp (SVM) dupe quan tam
va su dung nhieu trong linh vuc nhan dang va phan loai So sanh vai cac phuang phapphSn loai khac, kha nang phan loai ciia SVM la tuong duang hoac tot han dang ke [5].1.2.Lich su' giai quyet van de
Van dh phan loai van ban da dupe nhi^u nguai quan tam va nghien ciiu trongnhung nam gin day Nhi^u cong trinh nghien ciiu tren cac ngon ngu Tieng Anh va cacngon ngu khac dat dupe nhieu ket qua kha quan Mot so nghien ciiu trong linh vuc naynhu: Dua tren cac th6ng ke cua Yang&Xin(1999) [13], Support Vector Machine [8], B6i vai ti&ig Viet, cung da co rat nhiSu nghien ciiu nhu: Phan loai van b^n tiengViet vai bp phan loai vecta ho trp SVM [5] Error! Referenee-sourxe-HoHFound-.,Nghien ciiu ling dung tap pho bien va luat ket hop vao bai toan phan loai van ban tiengViet co xem xet ngu nghTa [2] , phan loai van ban bang phuang phap cay quyet dinh[6] ,Nhin chung, nhung each tiep can nay deu cho ket qua chap nhan dupe Tuynhien, van con mot so han che do nhung dac thu cua van ban tieng Viet ve phuangdien tu vung va cau din d^n hieu qua phan loai giam
Tim hieu cac ky thuat phan loai van ban tieng Viet
Trang 12Trong khoa luan nay toi se thuc hien mot so van de:
-Gidi thieu sa lupc vl bai toan phan loai van ban
-Cac van de lien quan din cong viec phan loai van ban nhu: Tach tir, bilu diln
van ban,
-Trinh bay cac giai thuat phan loai van ban da duqc su dung
-Nhung v3n de dac biet khi phan loai van ban tilng Viet
-Xay dung chuomg trinh phan loai van ban tieng Viet su dung giai thuat SVM
va phuong phap thong ke
Viec phan loai van ban se xac dinh mot van ban thupc chu d nao trong cac chu
de cho trudc hoac khong xac dinh duqc So lupng cac chu de co th6 duqc ma rpng tuy
y Trong khoa luan nay, toi se xay dung 5 chu dl la: Giao due, phap luat, sue khoe, thethao, yi tinh Vdi moi chu de, toi se thu thap 200 van ban mlu dimg lam tap du lieuhoc kiem tra.
1.4.Phuffng phap nghien c^u/ hirong giai quyet van de
Cac van de Ion can giai quyet trong de tai nay la:
-Nghien ciiu ly thuyet, giai thuat phan loai van ban: Tim hieu cac giai thuat phanloai van ban da duqc su dung va hieu qua cua cac giai thuat nay Xay dung chucmgtrinh so sanh hieu qua cac giai thuat phan loai
-Quy trinh phan loai van ban: Tim hieu cac quy trinh phan loai da duqc su dung(chu yu trong hai tai lieu: Phan loai van ban tigng Viet voi bq ph^n loai vector h6 trqSVM [5] , phan loai van ban ti&ig Viet bang phuong phap cay cpySt dinh [6] , lirachqn va hieu chinh quy trinh phan loai phu hop vdi tinh hinh thuc te
-Cac van de lien quan den phan loai
oTach tu: tach tir trong van ban tieng Viet la cong viec het sue kho khan
vi nhung dac thu trong van ban tieng Viet Da co rat nhieu tac gia nghien ciiu v6 van
de nay va dat duqc ket qua tot Trong khoa luan nay toi su dung cong cu tach tirvnTokenizer 4.1.1 [1]
oTrich chqn dac trung: Trong van ban co rat nhieu tir khong co y nghiaphan loai, nen chiing ta can loai bo nhung tir nay ra khoi van ban khi bieu dien Congviec trich chpn dac trung se chqn ra cac tir mang y nghia phan loai Cong viec nay seduqc thuc hien theo 2 hudng la thu cong va tu dqng
oBieu dien van ban: Be may tinh hieu duqc y nghia cua van ban va phanbiet duqc van ban nay vdi van ban khac, doi hoi phai bilu diln van ban dudi mot dangnao do Co nhieu each bi6u diln van ban duqc su dung va dat hieu qua cao nhu biludien dudi dang vecto, dang cay cii phap, Trong khoa luan nay toi sir dung phuongphap bieu dien theo dang vectcr vi day la phuong phap don gian va dap ling duqc yeucau (chi tiet se duqc trinh bay trong phan sau)
-Xay dung mo hinh SVM: Giai thuat SVM da duqc rat nhieu chuyen gia nghiencuu tir rat lau va dat duqc nhieu thanh tuu Idn SVM da duqc xay dimg thanh cac thuvien theo timg muc dich su dung khac nhau Trong khoa luan nay toi sir dung thu vien
SVM.NET ciia tac gia MATTEW JOHNSON phien ban 1.63 [1]
-Xay dung bq phan loai SVM: Mot van ban co the co nhung dac trung thuqcnhieu chu de khac nhau Vi vay, toi se xay dirng nhieu bo phan loai khac nhau, moi bophan loai se phan loeu duqc 2 chu de Do do, chiing ta se co n*(n-l)/2 bq phan loai
SV: Duong Thanh True - DTH082062Trang 2
Tim hieu cac ky thuat phan loai van ban tieng Viet
Trang 13SV: Duong Thanh True - DTH082062Trang 3
(vdi n la so chu de) Mot van ban can phan loai se duoc phan loai vdi tat ca cac bophan loai nay Neu van ban thuoc chu de nao thi diem cho chu de do duoc tang len.Cuoi cung ta se chpn chii dk co s6 dilm Ion nhat hoac chu d dlu tien trong nhiSu chu
oThu nghiem danh gia: Su dimg cac phuong phap danh gia hieu qua phanloai trong ltnh vuc nay, chu y^u la phuong phap h-eut ^ioj^-out
Tim hi6u cac ky thuat phan loai van ban tigng Viet
Trang 14Trang4SV: Buong Thanh True - BTH082062
Hinh 1: Gan nhan cho cac tai lieu van banBai toan phan loai van ban dupe chia lam hai loai chinh:
-Phan loai don nhan: Mot van ban chi dupe gan mot nhan duy nhat
-Phan loai da nhan: Mt van ban co the dupe gan nhieu nhan
Bai toan phan loai van bin co y nghTa rat quan trong trong viec xii ly du lieu vanban va dupe ling dung rang rai trong nhieu lmh vuc nhu: Tim kiem, trich lpc thong tin,lpc spam e-mail, phan loai e mail, phan loai tin hie tu d6ng va no con la ca sa, dpnglire thiic dly cac lmh vuc nghien cuu khac phat trien
Hien nay chiing ta da va dang tiep nhan mot khoi luong du lieu khong 16 ttr mpilmh vuc, viec khai thac va tim kiem tri thiic trong kh6i du lieu kh6ng lo do la viec lamnit c^n thi^t va dupe nhieu nha nghien cuu quan tarn May hoc (Machine Learning) lamot trong nhung huong tiep can khai mo du lieu dat dupe nhi^u thanh tuu nhlt hien
nay.
Lmh vuc may hoc dupe phan lam 3 loai: Hoc co giam sat, hoc khong gjam sat,hoc tang cuomg Hien nay cac phuong phap hoc co giam sat dupe su dung nhieu trongbai toan phan loai van ban va dat dupe nhieu thanh cong
2.2 Mo hinh phan loai van ban
Viec phan loai van ban theo cac phuomg phap hoc co giam sat dupe chia lam 3giai doan chinh: Chuan bi du lieu, Man luyen du lieu, phan loai va danh gia ket qua.2.1.1 Giai doan chuan bi du* lieu
Bay la giai doan dau ti6n trong qua trinh phan loai van ban, ket qua cua giai doannay la tao ra mot khong gian vector lam ca sa cho giai thuat hoc sau nay Bay la giaidoan quan trong va co anh huong rit Ion den hieu qua cua bp phan loai sau nay vi neucac tri thuc khong dly du va chinh xac thi khong thl nao M^n luyen dupe mot bp phanloai vai hieu qua cao.
Giai doan nay bao gom cac cong viec sau:
-Thu thap du lieu mau: Bay la van de kha quan trong, cong viec nay doi hoi tonkha nhieu thai gian va cong sue chung ta co the lira chpn du lieu tir nhieu nguon khacnhau, tuy nhien phai ddm bao cac du lieu thu dupe phai co dp phan loai chuan va miic
dp tuong tu cua cac van ban.
VBn
VB5 VB4
Bat toan phan loai van ban la mot trong nhung bai toan kho trong linh vuc xu ly
du lieu van ban Viec giai bai toan nay chinh la viec gan nhan cho timg van bin thupcmot trong cac chu de cho tnroc.'
Tim hieu cac ky thuat phan loai van ban tieng Viet
Trang 15Trang5SV: Duong Thanh True - DTH082062
Hinh 3: Mo hinh giai doan huan luyen2.1.3 Giai doan phan lo^ va danh gia
Be thuc hien phan loai mot tai lieu chiing ta phai thuc hien cac budc cua giaidoan chuan bi du lieu doi vai tai lieu nay Ket qua tao thanh mot khong gian vector vadua vao bo phan loai de phan loai
Viec danh gia bo phan loai dugc chia lam hai mat la danh gia tren tap du lieuhgc va danh gia tren cac du lieu mai Cln luu y chgn do do phu hop vdi giai thuat hgc
Bg phan lop
May hgcKhong gian vector
cho giai thuat hoc
Hinh 2: Mo hinh giai doan chudn bi die lieu2.1.2 Giai doan huan luyen
Sau khi da xay dung xong tap du lieu hoc, chung ta se sir dung cac giai thuat hoc
da chon trudc do nhu: SVM, cay quyet dinh, kNN, Naive Bayes, dS huln luyen trentap dir lieu hoc nay Ket qua cua giai doan nay chiing ta se thu dugc cac bg phan lop
Trich chon dac trung
-Trich chon dac (rung: Lira chon cac tCr co y nghia phan loai cao, va loai bo cac
tu hoac thuoc tinh khong mang y nghia phan loai ra khoi tap du lieu nham nang caohieu suit phan loai va giam thai gian huan luyen
Tim hieu cac ky thuat phan loai van ban tieng Viet
Trang 16Trang 6SV: Buong Thanh True - BTH082062
Hinh 4: Mo hinh giai doanphdn lap
2.3 Cac cong viec chinh trong qua trinh phan loai
2.3.1.ChuSn hoa van ban
Be he thdng phan loai co thl truy cap dupe cac van ban, doi hoi cac van ban phaidupe dinh dang theo mot quy tac chung Ngucri ta thuong dung plain text (van banthuan tiiy) lam dinh dang cho cac tap tin hoc va tap tin mdi can phan loai
Du lieu cho he thdng phan loai dupe thu thap tur nhieu nguon khac nhau nen khotranh khoi gap cac I6i v6 viet sai chinh ta hoac 16i ngu phap, Bieu nay anh hudng rltnhieu den viec tach tir va xay dung he thdng phan loai Be nang cao hieu qua hoc vaphan loai cua he thong phan loai chiing ta can loai bo hoac chinh sua lai cac loi naytrudc khi dua van ban vao he thong
2.3.2.Tachtu
a.Vai tro cua tach tir
Tach tir co vai tro rat quan trpng trong bai toan phan loai van ban, no giup chogiai thuat hoc co the hieu va phan tich dupe van ban Neu tach tu khong chinh xac cothe dan den hieu sai y nghta van ban Moi ngon ngu tu nhien co nhung dac thu riengnen viec tach tir tren cac ngon ngu khac nhau se co nhung diem khac nhau Chang han,doi vdi van ban tieng Anh moi tir se la mot tieng va each nhau bai dau khoang trangnhung tieng Viet thi khong Moi tir trong tieng Viet co the gom nhieu tieng va con conhieu y nghia khac nhau tuy thupc vao ngu canh trong cau
b.Thuat toan Maximum Matching
Thuat toan nay co 2 dang:
-Bang dan gian: Bung de giai quyet nhap nhang tir dan (Yi-Ru-Li, 1995) Ytuang cua dang nay, gia su co mot chuoi ky tu Cj, C2, C3, ,Cn Buyet chuoi bat dautir ki tu dau tign cua chuoi, Ian lupt kiem tra _Ci_ co phai la tir hay khong, sau do kiemtra _CiC2_ co phai la tir hay khong Tiep tuc nhu the cho den khi tim dupe tir dai nhat
co trong tir dien Chpn tir do, sau do tiep tuc qua trinh tren nhung tir con lai cho denkhi xac dinh dupe toan bp cac tir
- Bang phiic tap: Bay la dang bien the khac cua thuat toan Maximum Matching
do Chen va Liu (1992) de xuat, no phiic tap han nhieu so vdi dang dan gian Ho cho
>l Tai lieu duoc Dhan loaiTrich chpn dac tnmg
Bieu dien
Bp phan lop
Khong gian vectorcho giai thuat hocc— **}
lxiily
L
•:;<• •• !!
Tai lieu mbi
Tim hieu cac ky thuat phan loai van ban tieng Viet
Trang 17rang phan tich hop ly nhat dg chpn ra tir la phan tich tren bp ba tir co chigu dai ldnnhat Bat dau tir dau tign cua chudi neu co sir nhap nhang (vi du _Cj_ la tir nhung_C]C2_ cung la tir, ) thi chiing ta tim nhung tir kg tigp bit dau tir hai tir do, tucrng tirnhu vay cho den khi chiing ta tlm dupe tat ca cac bo ba tir Sau do, chpn ra bp ba tir cochieu dai ion nhat Gia sir ta co bp ba tir dai nhat nhu sau:
Mo hinh MMSEG la he thong nhan dign tir cho van ban tigng Quan thoai (Quocngu Trung Quoc) do Chih_Hao_Tsai (1996) gidi thieu Mo hinh nay dupe md rpngdua trgn hai bign thg ciia thuat toan Maximum Matching Bigm mdi cua mo hinh nay
la sir dung thgm ba luat khu nhap nhang niia Hai trong ba luat nay dupe gidi thieu bdiChen va Liu (1992) va mot luat con lai do Chih_Hao_Tsai de xu^t
Trong mo hinh nay, ChihHaoTsai da thuc nghiem trgn tigng Quan thoai va ketqua dat dupe 98% Bay la kit qua tuong doi cao so vdi cac phucmg pha^ khac 6ng da
su dung bp tir dign gom 124,499 muc tir da tu (tuong duong vdi mot tigng trong tigngViet), chigu dai cua nhirng muc tir la 2 dgn 8 tu va tan suat su dung cua cac muc tirdon, gom 13,060 tir don dupe su dung trong luat bon cho viec khu nhap nhang tir Chitiet thuat toan dupe mo ta nhu sau:
-Bang don gian: Boi vdi tu Cn trong chuoi cac tu, so khdp chuoi con bat dau vdi
tu Cn vdi tir trong tu dign dg tim tat ca cac tir so khdp co thg
-Phiic tap: Bdi vdi tu Cn trong chuoi cac tu, tim tat ca cac bp ba tir bat dau bdi
Cn cd thg co, khdng quan tam tir dau tign co bi nhap nhang hay khdng Nhung bp ba tirnay chi dupe tao ra khi cd mot nhap nhang cua tir dau tign Sau do su dung bon luatkhu nhap nhang sau de tim tir dung
o Luat 1: Maximum matching (Chen & Liu, 1992)
• Maximum matching don gian: Lay tir cd chigu dai ldn nhat
•Maximum matchingphuc tap: Lay tir dau tign tir bp cd chigu dai dai nhlt, neu
cd nhigu hem mot bp dai nhat thi ap dung luat tigp theo
o Luat 2: Chigu dai trung binh cua tir ldn nhlt (Chen & Liu, 1992) d cudi mdichuoi thudng gap nhung bp chi cd mpt hoac hai tir Vi du, nhung bp sau cd cung dpdai va cung bign doi chigu dai tir
l-_Ci C2 C3
2 _C,C2C3_
Luat 2 cho phep lay tir dau tign cua bp cd trung binh dp dai tir ldn nhat trong vi
du tren ta se liy tir _C!C2C3_ tir bp thir hai Gia thuygt cua luat nay la ta gap trudnghop tir nhigu tu nhigu hon tir mot tu
Luat nay chi cd lpi khi thigu mot hoac mot vai vi tri trong bp Khi bp la bp ba thiluat nay se khdng chinh xac Bdi vi, bp ba tir cd cung tdng dp dai, di nhign se cd cung
dp dai trung binh Vi thg, chiing ta can chpn giai phap khac
oLuat 3: Bp bign ddi nhd nhat cua chigu dai tir (Chen & Liu, 1992) Gia
su, cd hai bp ba sau:
SV: Bucmg Thanh True - BTH082062Trang 7
Tim hieu cac ky thuat phan loai van ban tigng Viet
Trang 18Trang 8SV: Duong Thanh True - DTH082062
Hinh 5: Bieu dien van banK6t qua ta thu dupe vector ~vb = (1,0,1,1,0, )
a.Trpng s6 logic
Trpng s6 tir logic la phuong phap don gian nhit trong viec dinh trpng so tir.Trong tiep can nay, gia tri ciia tir ki hieu la 1 neu no xuat hien trong tai lieu ngupc lainlu no khong xuat hien trong tai lieu ki hieu la 0
b.Trpng s6 tin suit tir
The thaoxam mu
Vi tinh
illlC C0"
0 1 1 0 1
^: j>
Hacker "mu xam" mang bi danh
D35m0ndl42 da khai thac 16 hing bao
mat de tham nhap vao may chu web ciia
ba website ldn, bao gom: Skype.com,
Luat 4 cho phep lly tir dlu tien cua bp vdi ting ldn nhat cua logarit tin si Vikhong the cd hai tu cd chinh xac ciing mpt tin si nen se khong cd nhap nhang sau khi
ap dung luat nay.
2.3.3 Biiu dien van ban
Be may hoc cd the hieu va phan tich dupe cac van ban thi chung ta can bieu diencac van ban theo mpt mo hinh nao do Tuy thupc vao tirng thuat toan phan loai khacnhau ma chung ta cd mo hinh bieu dien rieng Mpt trong nhung mo hinh don gian vathudng dupe su dung trong truimg hop nay la mo hinh khong gian vector Trong mohinh nay moi van ban dupe bieu dien theo dang mpt vector
Trang 19SV: Duong Thanh True-DTH082062Trang 9
a.Loai tit thuong
Trong van ban co rlt nhigu tir khong that sir can thiSt va khong co y nghia trongviec phan loai van ban duac gpi la nhung tit tam thudng ((iay sto^worcj1 Nhung tit naythupc nhung loai nhu tit quan he, tit lien ket c^u, cac chu s6, dau cau, Cac tit naythudng xuat hien rat nhieu trong van ban va khong the hien npi dung phan loai cua vanban do Vi vay can phai loai bo cac tit nay ra khdi van ban de tao tinh rieng biet giuacac van ban, gop phan giam chieu dong thai tang dp chinh xac va toe dp xu ly van ban
Co nhieu phuong phap de loai bo tit tam thudng Phuong phap co dien do la lapdanh sach liet ke cac tir tlm thudng can loai bo Tuy each lam nay dan gian nhungkhong tong quat vi khong the liet ke het tat ca cac tit Chung ta de dang nhan thayrang, tit thudng la nhung tit co so Ian xuat hien qua it hoac qua nhieu trong cac vanban Chung ta co the dua vao tan suat tai lieu va dat nguang loai bo chiing
b.• • Gi^m chieu
Van ban sau khi duac bieu dien se tao thanh cac vector vdi so chieu chinh la so >tir dung de bieu dien Be bp phan loai lam viec co hieu qua thi can phai dimg mot soluang rlt ldn cac tit bilu diin, di^u do lam cho viec hoc va phan loai cham di rat nhieu
va khong co hieu qua thuc tien Trong vai trudng hap viec dimg qua nhieu cac dac
Tan suat tir lars6 Ian xuat hien ciia tit do trong tai lieu ki hieu TF Cach dinh trong
so tit nay cho rang mot tit la quan trong trong mot tai lieu neu no xuat hien nhieu Iantrong tai lieu do
Trong do: w; la gia tri ciia tit thii i, TF; la so Ian xuat hien cua tit thii i trong van
Tim hidu cac ky thuat phan loai van ban tiSng Viet
Trang 20SV: Duong Thanh True - DTH082062Trang 10
oTin suat tai lieu (Document Frequency):
Tan suat tai lieu cua mot tir la so luong tai lieu chiia tir do Trong phuang phapnay, ta se dat nguang de loai bo nhung tir co tin suat tai lieu nh6 hern hoac ldn hannguang dinh truac Do la nhung tir tim thucmg hay nhilng tir it thong dung gay ra loinhilu tir trong phan loai Viec loai bo nhung tir nay nham cai thien dp chinh xac phan
Trong bai toan phan loai van ban, phuang phap thong ke x2 tinh toan su phuthuc giua tir t va lop c, gia tri x2 cang Ion danh gia muc dp uu tien cua tir t phu thupcvao lap c cang nhieu.
Dp do x2 toan cue tinh tren toan bp tap huan luyen:
X ~
MIavg(t)=
o Thong kex2 (Chi-Square Statistic)Thong ke Khi-binh phuang la phuang phap danh gia dp phu hap giua s6 lieuquan sat va ky vpng Ki hieu x2 l^ gia tri dp phu hop giOa cac tri s6 thuc te quan sat(O) va cac tri so ly thuyet duac ky vpng (E), khi do cong thuc thong ke x2 co dang:
t{o.i) ce{o,i}
Trong do:
P(t,c) la xac suat xuat hien dong thai cua tir t trong lap c
P(t) la xac suit xuit hien cua tir t
P(c) la xac suit xuit hien cua lop c
Dp do MI toan cue (tinh tren toan bp tap tai lieu huin luyen) cho tir t duac tinhnhu sau:
' trung d&bieu dien van ban lai lam giam hieu qua cua bp phan loai Vi vay viec giam
so luong dac trung bieu dien van ban la mot viec lam rat can thiet.
Co hai hudng khac nhau trong viec giam so chieu, phu thupc vao nhiem vu giam
so chieu la bp phan hay tong the
-Giam so chiiu bp phan: Voi mot lop ^, chpn nhung thupc tinh hay nhung tir madoi vdi lap Q no co dp lai thong tin nhat
-Giam chieu tong the: Chpn nhung thupc tinh hay nhung tir co dp lpi thong tin
de thuc hien phan lap cho tat ca cac lap C = {ci, 02,03, , Ck}.
Co nhieu phuang phap lam giam chieu cho bai toan phan loai van ban nhu: dp dotuang ho (Mutual Information), thong ke Khi-binh phuang (Chi-Square Statistic) vatan suat tai lieu (Document Frequency),
oDp do tuang ho (Mutual Information):
Trong phan loai van ban, phuang phap nay su dung dp do luong tin tuang hogiua moi tir va moi lap tai lieu de chpn cac tir tot nhat Luong tin tuang ho giua tir t valap c duac tinh nhu sau:
Tim hieu cac ky thuat phan loai van ban tieng Viet
Trang 21SV: Duong Thanh True - DTH082062Trang 11
o Phai xac dinh ham tinh kholng each phu hop
o Mat nhieu thai gian trong qua trinh tim kiem k du lieu Ian can va khd cdthe tim ra k toi uu.
o Vdi trudng hop van ban cd nhieu thi viec phan loai la khong tot
loai Tuy nhien c&n xem xet dS dat ngvrong loai bo nhung tvr thich hop vi tiin suit tailieu eua mot ttr- con the hien su quan trpng cua tvr do trong phan loai.
2.4 Cac phuffng phap phan loai van ban
Co nhieu phuong phap giai thuat hoc tiep can cho bai toan phan loai van ban.Moi phuong phap co nhung dac thu rieng va dem lai nhiing thanh cong nhat dinh Cacphuong phap dupe su dung nhieu trong llnh vuc nay nhu:
2.4.1 Phirong phap k lang gieng gan nhat (kNN)
KNN la phuong phap truyen thong kha noi tieng theo hudng tiep can thdng ke dadupe nghien cuu trong nhieu nam qua kNN dupe danh gia la mot trong nhung phuongphap tit nhlt dupe su dung tu nhung thai ky diu trong nghien cuu vi phan loai vanban No con co nhiing ten gpi khac nhu Instance-based, Lazy hoac Memory-based.kNN co the ap dung dupe cho 2 kieu bai toan hpc nhu: Bai toan phan loai va bai toan
du doan/hoi quy No dupe ling dung thanh cong trong hiu het cac lmh vuc tim kiimthong tin, nhan dang, phan tich du lieu,
- Thuat toan: Thuat toan phan lop cua kNN dupe chia lam 2 giai doan:
oGiai doan hpc chi don gian la luu lai cac tap dir lieu hpc
oGiai doan phan lop: Be pMn lop cho tap dir lieu moi z, ta xac dinh cackhoang each tit z den x Xac dinh tap NB(z), cac lang gieng gan nhat cua z tinh theoham khoang each d Ket qua z se dupe phan vao lop chiem so dong trong s6 cac lopcua t|p du lieu hpc trong NB(z)
Boi vdi phuong phap kNN ham tinh khoang each co vai tro rat quan trong vathubng dupe xac dinh truoc khong thay doi trong qua trinh hpc va phan loai Co mot
so ham tinh khoang each lua chpn nhu: Cac ham tinh khoang each hinh hoc, ham tinhkhoang each Hamming, ham tinh dp tuong tu Cosine Moi loai ham tinh khoang eachthich hop cho tirng loai bai toan rieng Trong bai toan phan loai van ban ta su dungham tinh dp tucmg tu cosine nhu sau:
Tim hieu cac ky thuat phan loai van ban tieng Viet
Trang 22Trong do:
Pr(Ci): Dupe tinh dua tren tan suat xuat hien tai lieu trong tap huan luyen
Pr(xJC;): Dupe tinh tu nhung tap thupc tinh da dupe tinh trong qua trinh huanluyen.
- Uu nhupc diem: Naive Bayes la mot phuang phap rat hieu qua trong mot sotrudng hop Neu tap du lieu huan luyen ngheo nan va cac tham so du doan (nhu khonggian dac trung) cd chat lupng kem thi se dan den ket qua thap Tuy nhien, no dupedanh gia la mot thuat toan phan lap tuyen tinh thich hap trong phan lap van ban nhieuchii dl vdi mot si uu dilm: Cai dat dofn gian, toe dp nhanh, dl dang cap nhat du lieuhuln luyen mdi vl cd tinh doc lap cao vdi tap huln luyen, cd the su dung kit hopnhieu tap huan luyen khac nhau Thong thudng, ngudi ta con dat them mot nguong toi
uu de cho kit qua phan lap kha quan Du vay, phuang phap nay cd nhupc diem la dotinh doc lap dilu kien cua cac thupc tinh nen no lam giam dp chinh xac khi phan loai
SV: Duong Thanh True - DTH082062Trang 12
Khi do luat phan lap cho tai lieu mdi Xnew = {xi,x2> ,xn} la
Theo tinh chit doc lap dieu kien:
Pr(X)
Trong do:
X, Y la cac bien bat ky (rfri rac, so, cau true, ), du doan Y tu X
Pr(X): Xac suit Xxayra
Pr(Y): Xac suit Yxayra
Pr(X|Y): Xac suit xay ra X vai dilu kien Y xay ra
Pr(Y|X): Xac suit xay ra Y vai dilu kien X xay ra
Ap dung trong bai toan phan loai cac du kien can co:
oD: tSp dft lieu huan luyen da dupe vector hoa duai dang x =
2.4.2 Phuxmg phap Naive baycs <e ^ :
Giai thuat Naive Bayes dua chu ylu vao dinh ly xac suit cua Bayes, vdi gia sula: cac thupc tinh (bien, chieu) doc lap nhau va do quan trpng cua cac thupc tinh blngnhau Mac du viec gia thuyet nay khong bao gia dung vdi du lieu nhung trong thuc teNaive Bayes cho ket qua kha tot va thanh cong trong lihh vuc phan loai van ban, lpc
Trang 23S V: Duong Thanh True - DTH082062Trang 13
7
oDp lpi thong tin khi chpn thupc tinh A phan hoach du lieu D thanh v
phin la:
Gain(A) = Info(D) - InfoAP)
- Giai thuat CART:
oDp hon loan sau khi su dung thupc tinh A phan hoach du lieu D thanh vphan dupe tinh nhu congthiic
V^ \D,\
//()2^L
2.4.3 Phuong phap cay quyet dinh
Cay quyet dinh la mot trong nhom 10 giai thuat hang dau ciia khai mo dulieu[l 1] Khac vdi cac mo hinh hoc khac nhu mang na-ron hay may hoc vector ho trq,
mo hinh hpc cua cay quyet dinh don gian, nhanh, cung cho ket qua tot, dac biet ket quasinh ra ciia cay quyet dinh la tap cac luat don gian de dien dich Giai thuat cay quyetdinh co thi xu ly dupe ca kilu du lieu rdi rac va lien tuc Cay quyet dinh co the timthiy trong hiu h^t cac ling diing nhu: Phan lop du lieu van ban, phan lop thu rac, nhandang tan cong va ca vln dl h6i quy
Giai thuat hpc cay quyet dinh bao gom 2 budc Ion: Xay dung cay (Top-down),cat nhanh (Bottom-up) de tranh hpc vet Qua trinh xay dung cay dupe lam nhu sau:-Bat dau tu nut goc, tat ca cac du lieu hpc 6 nut goc,
-Neu dtt lipu tai 1 nut co cung lap thi nut dupe cho la nut la va nhan ciia niit lanhan cua cac phSn tu trong niit la (hay luat binh chpn so dong neu nut la co chiia cacphan tu cd lop khac nhau),
-Nlu du lieu p niit chiia cac phan tu co lop rat khac nhau (khong thuan nh^t) thiniit dupe chpn la mit trong, tiln hanh phan hoach du lieu mot each de quy blng viecchpn mot thupc tinh dl thuc hien phan hoach t6t nhSt co thi
Qua trinh xay dung cay chii yeu phu thupc vao viec chpn thupc tinh tot nhat dlphan hoach du lieu Mot thupc tinh dupe cho la tot va duac su dung de phan hoach dulieu sao cho ket qua thu dupe cay nho nhat Viec lira chpn nay dua vao cac heuristics:chpn thupc tinh sinh ra cac niit thuan khiet nhat Hien nay co 2 giai thuat hpc cay quyetdinh tieu bieu la C4.5 ciia Quinlan [9], CART ciia Breiman va cac cpng su [7] Bedanh gia va chpn thupc tinh khi phan hoach du lieu, Quinlan de nghi su dung dp lpithong tin (chpn thupc tinh co dp lpi thong tin ldn nhat) va ti so dp lpi dua tren hamentropy cua Shannon Trong khi do Breiman de xuat sir dung chi so Gini (chpn thupctinh co chi so Gini nho nhat) de chpn thupc tinh phan hoach
Dp lpi thong tin cua mot thupc tinh dupe tinh bang dp do hon loan trudc khi phanhoach trir cho sau khi phan hoach Gia su Pj la xac suat ma phan tu trong du lieu D
thupc lop Q (i = l,k) khi do:
-Giai thuat C4.5:
oDp do hon loan thong tin trudc khi phan hoach dupe tinh nhu sau:
Tim hilu cac ky thuat phan loai van ban tieng Viet
Trang 24Trang 14SV: Duong Thanh True - DTH082062
Hinh 6: Mat sieu phang phan tach cac mau duomg khoi cac mau dm
Trong trubrig hop nay, bp phan loai SVM la mat sieu phang phan tach cac mauduong khoi cac mlu am vdi dp chenh lech cue dai, trong do dp chenh lech con gpi la
le (margin) xac dinh blng khoang each giua cac mlu duong va cac mau am gan matsieu phang nhlt Mat sieu phang nay dupe gpi la mat sieu phlng le toi uu
C
u
-Cac mau duong la cac mau thupc linh vuc quan tam va dupe gan nhan yj = 1;
-Cac mlu am la cac ralu khong thupc lmh vuc dupe quan tam va dupe gan nhanyr-1
Trong do mlu la cac vector ddi tupng dupe phan loai thanh cac mlu duong va
mau am.
1AA Phmmg phap may hoc vector ho trgr (SVM)
Dae tnmg ca ban quyet dinh kha nang phan loai cua mot bo phan loai la hieu suattong quat hoa, hay la kha nang phan loai nhttng dO lieu mdi dua vao nhung tri thiic datich luy dupe trong qua trinh huan luyen Thuat toan huan luyen dupe danh gia la totneu sau tjua trinh huan luyen, hieu suat tong quat hoa cua bp phan loai nhan dupe cao.Hieu suat tong quat hoa phu thupc vao hai tham s6 la sai so huan luyen va nang lucciia may hoc Trong do sai so huan luyen la ti le 16i phan loai tren tap du lieu huanluyen Con nang luc may hpc xac dinh bang kich thudc Vapnik-Chervonenkis (kichthudc VC) Kich thudc VC la mot khai niem quan trong doi voi mot ho ham phan tach(hay la bp phan loai) Dai lupng nay dupe xac dinh bang so diem cue dai ma ho ham
co the phan tach hoan toan trong khong gian doi tupng Mot bp phan loai tot la bpphan loai co nang luc thap nhat (co nghta la dom gian nhat) va dam bao sai so huanluyen nho ^huong phap SVM dupe xay dmg dua tren y tudmg nay
Xet bai toan phan loai don gian nhat, phan loai hai phan lop voi tap du lieu mau:
1=1
•'•• e '
Gini{D) = Tim hieu cac ky thuat phan loai van ban tiSng Viet
Trang 251-SV: Buong Thanh True - BTH082062Trang 15
Cac mat sieu phlng trong khong gian d6i tupng co phuang trinh la: wtx + b =
0, trong do w la vector trong so, b la dp dich KM thay doi w va b, Mrdng va khoangeach tir goc tpa dp den mat sieu phlng thay doi Bp phan lo^i SVM dupe djnh nghianhu sau:
N6u tap du lieu huan luyen la kha tach tuy6n tinh thi ta co cac rang bupc sau:
wTxt + b>lneuyi = +l (2)wTxt + b < 1 neu yt = -1 (3)Hai mat sieu phang co phuang trinh la wTXi + b = +1 dupe gpi la cac mat sieuphlng h6 trp (cac duong net dut tren hinh 6)
Be xay dirng mot mat sieu phang 16 toi uu, ta phai giai bai toan quy hoach toanphuang nhu sau:
Cue dai hoa:
2{li i - |SSLi EJLi atajytyjxfx, (4)
Vdi cac rang bupc:
Ui>0(5) ZjLiaiyj = 0(6)
Trong do cac he so Lagrange a^, Z= 1,2, , N, la cac bien can dupe toi uu hoa.Vector w se dupe tinh tir cac nghiem cua bai toan toan phuang noi tren nhu sau:
Neu tap du lieu huan luyen khong kha tach tuyen tinh thi ta co the giai quyet theohai each.
Tim hieu cac ky thuat phan loai van ban tieng Viet
Trang 26Vdi cac rang budc:
0 < at < C(10) a'.1ay = 0 (11)
Trong do k la mpt ham nhan thda man:
k(xl,Xj) = <p0ciy.<p(xj)(12)
Vdi viec dung mpt ham nhan, ta khdng cSn biet rd vd anh xa cp Hem nua, b^ngeach chpn mpt ham nhan phu hop, ta cd the xay dung dupe nhidu bd phan loai khacnhau Chang han, nhan da thiic k(Xj,Xj) = (xiTXj + l)p dan den bd phan loai da thiic,nhan Gaussian k(xi,Xj) = exp(-y||xj - Xj||2) din ddn bd phan loai RBF (Radial BasicFunctions), va nhan sigmoid k(Xj,Xj) = tanh(kxiTXj + 8), trong do tanh la tang hyperbol,dan tdi mang ncrron sigmoid hai ldp (mpt ldp noron an va mpt ldp noron dau ra) Tuynhien md uu diem cua each huan luyen SVM so vdi cac huan luyen khac la hau hdtcac tham so cac tham so ciia may hoc dupe xac dinh mot each tu ddng trong qua trinhhuan luyen.
Huan luyen SVM la giai bai toan quy hoach toan phuong SVM Cac phucmgphap so giai bai toan quy hoach nay ydu cau phai luu tru mdt ma tran cd kich thudcbang binh phucmg mau huan luyen Trong nhung bai toan thuc td, didu nay la khdngkha thi vi thdrig thudng kich thudc cua tap du lieu huan luyen thudng rat ldn (cd theldn tdi hang chuc nghin mau) Mpt trong nhung thuat toan giai quyet van de tren lathuat toan SMO Phucmg phap SMO giai quyet bai toan quy hoach toan phucmg SVM
ma khdng can sir dung ma tran luu tru.
Ba thanh phan cua SMO:
oPhuong phap giai tich de xu ly hai so nhan Lagrange,
oHeuristic de chpn so nhan tdi uu,
oPhuong phap de tinh b tai mdi budc
Uu didm:
oSVM cd kha nang tu ddng didu chinh cac tham sd de toi uu hda hieu suitphan loai tham chi trong nhung khdng gian dac tnrng cd sd chidu cao
oTrong bai toan phan loai van ban SVM dat kdt qua kha cao
SV: Duong Thanh True - DTH082062Trang 16
Cach thii nhat su dung mt mat sieu phang Id mem, ngtna la cho phep mot somiu huan luyen nam vd phia sai cua mat sieu phang phan tach hoac van d vi tri dungnhimg roi vao vung giua mat sieu phang phan tach va mat sieu phang ho trg tucmgling Trong trudng hop nay, cac he sd Lagrange cua bai toan quy hoach toan phuong
co them mpt can tren C duong - tham so do ngudi diing lira chpn Tham so nay tucmgung vdi gia tri phat ddi vdi cac mau bi phan loai sai
Cach thii hai su dung mpt anh xa phi tuydn q> de anh xa cac diem du lieu dau vaosang mpt khdng gian mdi co so chidu cao hem Trong khdng gian nay cac diem du lieutrd thanh kha tach tuyen tinh, hoac cd the phan tach vdi it ldi hem so vdi trudng hop sudung khdng gian ban dau Mot mat quyet dinh tuydn tinh trong khdng gian mdi setucmg ling vdi vdi mpt mat quyet dinh phi tuyen trong khdng gian ban diu Khi do, baitoan quy hoach toan phuong ban dau se trd thanh:
Cue dai hda:
Tim hieu cac ky thuat phan loai van ban tidng Viet
Trang 27Trang 17SV: Duong Thanh True - DTH082062
oBe dat ket qua phan loai tot can chon ham nhan Kernel phu hop
oYeu cSu phai lap di lap lai qua trinh huan luyen d6i vdi bai toan nhieulop vi SVM chi giai quyet bai toan phan lop 2 lop
oThai gian huan luyen lau
Tim hilu cac ky thuat phan loai van ban tigng Viet
Trang 28Trang 18SV: Duong Thanh True - DTH082062
Cac van ban dupe ghi vao cac tap tin rieng vdi dinh dang *.data (van ban tiengViet cd dlu-UTF8)
3.1.3.Tien xir ly van ban
a.Chuhn hoa van ban
Van ban khi dupe lay ve tu dpng tu cac trang tin hie deu duac dinh dang theoHTML Vi vay, can phai trich chpn lai npi dung chinh (chi lay phan chO) va loai bocac the XML ra khdi van ban Ngoai ra cung can loai bo cac ki tu phan each (khoangtrang, xuong dong, tab, ) de viec tach tu dupe thuan lpi hem.
Van de loi chinh ta: Van ban dupe thu thap tu nhieu nguon khac nhau nen khdtranh khdi cac loi chinh ta va su khong thong nhat trong viec bo dau va dung tu Thatvay, chung ta thudng gap cac trudng hop bd dau khac nhau nhu "tiiy" hoac "tuy" Tuyhai ttr nay cung y nghia nhung lai la hai tir khac nhau khien cho may hoc xem day lahai tu rieng, dan den lam giam hieu qua phan loai vi vay can phai chuan hda cac tirnay ve cung mot dang.
khong gian vector
Mo hinh hoaphan lop
>
Xay dungtap du lieu
"CHlTOfNG 3: NOI DUNG VA KET QUA NGHIEN ClfU 3.1 Qua trinh xay dung bo phan loai
3.1.1 Mo hinh cac birtrc thuc hien phan loai
Trong khda luan nay toi tin hanh viec phan loai van ban theo cac budc: xaydung tap du lieu mau, tien xu ly van ban, lira chpn dac trung, mo hinh hda khong gianvector, xay dtrng cac bp phan loai, thu nghiem va danh gia Trinh tu thuc hien cacbuac dupe mo ta theo so dd sau:
Tim hieu cac ky thuat phan loai van ban tieng Viet
Trang 29Trang 19SV: Duong Thanh True - DTH082062
Bang 1: Chudn hda bo daub.Tdch tir
Tach tir la viec lam rat quan trpng va khong th6 thiSu khi mu6n xur ly dtt lieu vanban Dae biet, doi vai linh vuc phan loai van ban thi dp chinh xac khi tdch tir anhhuomg rat Ion den hieu qua hoc va phan loai sau nay Hien nay co rat nhieu tac gia danghien cuu ve lihh vuc nay va dat dupe ket qua rat kha quan Cu the la cong cu tach tirvnTokenizer[l] cua nhom tac gia Le Hong Phuong, Nguyen Thi Minh Huyen, HoTucng Vinh, Azim Roussanaly Bang each ket hop nhieu phuong phap nhu finite-stateautomata technique, regular expression parsing va the maximal-matching; cong cuvnTokenizer dat dupe dp chinh xac khoang 97% Trong khoa luan nay toi su dungcong cu nay cho viec tach tir trong qua trinh hoc va phan loai
c.Loai tie tarn thubmg
Tir tarn thuong la cac tir khong mang y nghla phan loai trong van ban chua nonhu cac tir lien k6t cau, tir quan he, viec loai cac tir thuong dupe thuc hien theo haihuomg la "cung" va "mem".
-Loai tir thuong "cung": Trong huomg nay toi thu thap danh sach cac tir thuong va tien hanh loai bo chung khoi cac van ban trong qua trinh hoc Cp the toi da thu thapdupe 268 tu thuomg (chi tiet trong phan phuc luc B)
-Loai tir thuomg "mem": Tuy theo phuong phap phan loai dupe su dung ma se cocac each loai tir thuong khac nhau
o Phuong phap SVM: Dua vao gia tri TF_IDF, neu tir co gia tri vuptqua nguong quy dinh se bi loai bo
o Phuong phap Thong ke: Dua vao xac suat cua tir thupc chu de, neuxac suat nay thdp horn nguong quy dinh se bi loai bo khoi danh sachdac trung chu de.
uyuyuyuyuyoe5e6e6eoe oa5a6a
6a'
uyuyuyuyuyoe oe oe oe oe oa oa oa oa
Trang 30Trong do:
Wi' Gia tri cua tir thii i trong trong danh sach tir bieu dien
TF^ Tong so Ian xuat hien cua tir thii i trong tat ca van ban thuqc tap du lieuhoc
D: Tong so van ban trong tap du lieu hoc
DF^- So van ban trong tap du: lieu hoc cd chua tir thii i
b Phuffng phdp Thong ke
Trong giai doan nay can tong hop mot tap hop cac dac trung cho tirng chu de.Cac dac trung cua mdt chii de se duqc tong hop tir nhung van ban thuqc chu de do.Cac dac hung trong chii dh cd thl se duqc tinh xac suit thuqc ve chu dd do va se biloai bd neu gia tri nam ngoai nguong quy dinh Tuy nhien thuc nghiem cho thay viecloai bd cac ttr nay khong lam tang hieu qua phan loai Trong khda luan nay toi chi loai
bd cac tir thuong cd dinh (cd trong danh sach tir thudng) ma khong su dung phucrngphap tinh xac suat tir
3.1.5 Mo hinh hoa khong gian vector
a Phuffng phdp SVM
Trong phucmg phap SVM, mdi van ban trong tap du lieu hoc se duqc mo hinhthanh mot vector vdi so chieu chinh la so tir dung de bieu dien van ban Tuy nhien,mot van ban khong the nao chua het cac tir trong tap du lieu nen se gay ra su lang phitrong khong gian luu tru Vi vay toi mo hinh hda cac van ban thanh cac vector thua,tire la chi ghi lai nhung tir nao cd xuat hien trong van ban Vi du:
Vbj= {tirval], t2:val2, t3:val3, , tn:valn}
Trong do:
t;: vdi i = {1,2,3, ,n} la tir thii i trong van ban
SV: Duong Thanh Tnrc - DTH082062Trang 20
3.1.4 Lira chon dac trirng ' -i; •
a Phuffngphdp SVM
Be may hoc co the hieu va phan loai duqc van ban doi hoi can phai bieu dien vanban theo mot phuong phap nao do Mo hinh hoa van ban dudi dang vector la phuongphap dom gian va du^e su dung rat nhieu trong linh vuc nay B6 bilu diin van bantheo dang vector vdi mdi chieu la mot tir trong tu dien doi hoi phai xay dung trudc mot
tu dien du ldn bao gdm tat ca cac tu tieng Viet Tuy nhien day la viec khd co the thuchien duqc, vdi lai viec su dung qua nhieu tu (ke ca tir thudng) se gay ra mot su du thvrarat ldn va anh hudng nhieu den viec hoc va phan loai sau nay Vi vay dk dan gian vatang hieu qua phan loai, toi chi dung cac tir cd trong tap du lieu hoc de bieu dien vanban Cac tir nay du^e tdng hop lai tir cac dac trung cua van ban va da loai bo cac tirthudng (khdng mang y nghTa phan loai)
Ngoai cac tir thudng co dinh da duqc loai bo, trong van ban con rat nhieu tirkhong mang y nghTa phan loai hoac cd y nghTa phan loai thap, cac tir nay cung can biloai bo khdi danh sach cac tir bieu dien van ban Viec loai bo cac tu nay se dua tren giatri TFIDF ciia hi do, neu tir nao cd gia tri TF_EDF nam ngoai nguong quy dinh thi se
bi loai bd Gia tri TF_IDF duqc tinh nhu sau:
w = TFt * l(
Tim hieu cac ky thuat phan loai van ban tieng Viet
Trang 31SV: Duong Thanh True - DTH082062Trang 21
valj: vai i = {1,2;3, ,n} la gia tri cua tir thu i trong van ban
Co nhieu each tinh gia tri cua tit trong van ban nhir da trinh bay trong phan ca sd
ly thuyet Trong khda luan nay toi su dung each tinh trong so logic vi day la each tinhdon gian nhat va dap ^ing duqc yeu cu Vi vay cac gia tri vali trong cong thiic tren se
la mot hang sd co gia tri 1 Khi do cac van ban sau khi bieu dien se co dang:
va tien hanh xay dung cac mo hinh tren nhiing tap du lieu con nay Cu the, vdi 5 chud^ toi xay dung 5*(5-l)/2 = 10 bq phan loai va 10 tap du lieu hoc tuang ung cho cac
bq phan loai nay Khi phan loai mot van ban toi se thuc hien tren tat ca cac bq phanloai nay va ket qua chinh la chu dS duqc dq doan nhidu nhlt Neu co nhiSu chu de cdcimg so luqng du doan thi toi se chpn chu d dfiu tien trong nhdm
b Phumg phap Thong ke: Phuang phap nay khong can phai xay dungcac bq phan loai rieng cho timg chu dg Bq phan loai chinh la tap hop cac dac trung daduqc tong hop trong giai doan lira chqn dac trung
3.1.7.Thu nghiem va danh gia
Trong lTnh vuc khai pha du lieu ndi chung va trong bai toan phan loai van banndi rieng, cd nhidu phuang phap danh gia hieu qua cua cac giai thuat nhu:
-Dung mot tap du lieu lam tap du lieu hoc va mot tap du lieu khac lam tap du lieukiem tra.
-Su dung nghi thuc k-fold: Chia tap du lieu thanh k phan (fold) bang nhau, lap lai kIan, moi ISn su dung k-lfolds de hoc va 1 fold de kiem tra, sau do tinh trung binhcua k Ian kiem tra.
-Su dung nghi thuc hold-out: Lay ngau nhien 2/3 tap du lieu de hoc va 1/3 du lieucon lai dung cho kiem tra-, cd the lap lai qua trinh nay k Ian roi tinh gia tri trungbinh
Trong khda luan nay toi su dung phuang phap hold-out de danh gia cac giai thuatphan loai.
3.2 Xay dung he thong phan loai van ban
3.2.1 Yeu cau
Xay dung he thdng thuc hien hai nhiem vu ca ban:
- Thu nhdt: Quan ly cac vdn &h lien quan ddn phan loai van ban
Tim hidu cac ky thuat phan loai van ban ti^ng Viet
Trang 32SV: Duong Thanh True -DTH082062Trang22
.' ?i Vdi nhiem vu nay he thong can phai co day du cac chiic nang phuc vu cho viec'•<phan loai van ban trong cac giai doan chuan bi du lieu, huan luyen du lieu, phan loai
va danh gia Cac chuc nang ccr ban can xay dung de phuc vu cac cong viec tren la: oQuan ly chii d: them, xoa, cap nhat thong tin chii d,
oQuan ly van bikn thuoc chu de: them, xoa, cap nhat npi dung, chuan hoa, tach tu, dem so Ian xuat hien,
oQuan ly dac trung van ban: xoa tir, chon dac trung van ban, loai tu thuong, tinh trong s6 tu,
oQuan ly dac trung chu dt xoa tir, tong hop dac trung theo chu d6, loai tirthucmg, tinh trong so tir,
oQuan ly tir bieu dien van ban: tong hop danh sach tir bieu dien, xoa tir,tinh cac trong s6 tir, loai tir thuong,
oQuan ly tir thucmg, cac tuy chon loai tir thuong: them, xoa, cap nhat tirthuong, loai tir thuong tu dong (trong tap du lieu hoc),
oQuan ly tap du lieu hoc: Xay dung tap dfl lieu hoc, xuat tap dO lieu hoccho cac bo phan loai, danh gia, sao luu, phuc h6i,
oQuan ly bo phan loai: Xay dung cac bo phan loai thu cong va tu dong,kiem tra cac bo phan loai,
oThu nghiem va danh gia: danh gia hieu qua phan loai cua cac giai thuathoc, so sanh cac ^iai thuat.
-Thii hai: Uhg dung phan loai cac tai lieu van ban dupe luu tru tren may tinh.Sau khi xay dung dupe cac bp phan loai he thong phai thuc hien phan loai cac tai lieu,
co the phan loai timg tai lieu hoac cung luc nhieu tai lieu Khi phan loai mpt tai lieuchi c^n hien thi chii dl dupe nhan biet len man hinh Khi phan loai nhiSu tai lieu cinphan chia cac tai lieu vao cac thu muc khac nhau tuy theo chu dS ciia tung tai lieu.3.2.2 Phantich
c Sffdo Usecacse
-Usecase toitg quan
Tim hiSu cac ky thuat phan loai van ban tilng Viet
Trang 33Trang 23SV: Duong Thanh True - DTH082062
Quan ly la nguai co quyen
su dung tat ca cac chiicnang cua he thong va cSpquyn sit dung cho nguaidung.
Nguoi dung la nhom nguaisix dung he thong Tuynhien, chi duac su dungcac chiic nang ccr ban nhat.Mota
Quan ly
Nguoi dungTen Actor
Trang 34Trang24SV: Duong Thanh True - DTH082062
Hinh 10: Sa do Usecase chiic ndng quan ly die lieuoOuan ly dqc trung van ban
Hinh 9: Sa do Usecacse chicc nangphdn loaioOuan ly du lieu
Tim higu cac ky thuat phan loai van ban ting Viet
Trang 35Trang 25SV: Duong Thanh True - DTH082062
Hinh 12: So do Usecase chicc nang quan ly dqc tnmg chu deoQuan ly tic thtcamg
Hlnh 11: So do Usecase chitc nang quan ly dqc tnmg van ban
oQuan ly dqc trmtg chit di
Tim hieu cac ky thuat phan loai van ban tieng Viet
Trang 36Trang 26SV: Duong Thanh True - DTH082062
Hinh 14: So do Usecase chitc nang qudn ly tit bieu dienQudn ly tap die lieu hoc
Hinh 13: Sa do Usecase chitc nang qudn ly tit thitcmgoQudn ly tie bieu dien
(
Tim hieu cac ky thuat phan loai van ban tieng Viet
Trang 37Trang 27SV: Duong Thanh True - DTH082062
Hinh 16: Sa do Usecase chicc ndng quan ly bo phan loqiBang nhqp he thong
Hinh 15: Sa do Usecase chicc ndng quan ly tap die lieu hocoQuan ly bo phan loqi
Tim hieu cac ky thuat phan Ioai van ban tieng Viet
Trang 38Trang 28SV: Duong Thanh True - DTH082062
Hinh 18: Kien true he thong
Y nghia cac thanh phan:
-Database chiia ca sa dir lieu cua he thong
-Model chiia dir lieu va cac tinh toan xu ly logic de giai quyet van de ma phanmem hudng tdi
-Presenter la thanh phan dam nhan cac xu ly ve trinh bay cung nhu tuong tacd^n dir lieu ben dudi va cd thS tucmg tac &k thay doi View trong qua trinh xu ly
-View la thanh phan dam nhan trinh bay tir nhung du lieu cua Model va la tdnghop cua cac form, control dugc su dung.
ii Yeu cSu he thdng
>Phin cung
1
DatabaseModel
Hinh 17: So do Usecase nhom chuc ndng dang nhdp he thongd.Dae ta Usecase
Vi npi dung phan nay kha nhieu nen toi se gidi thieu chi tiet a phan phu luc A3.2.3 Thietke
a.Thiet ke kien true
i Kien true he thong
Tim hilu cac ky thuat phan loai van ban tigng Viet
Trang 39Trang 29SV: Duong Thanh True - DTH082062
Hinh 19: So do chuc nang he thong
iv So do giao dien he thong
-Quan ly bp
phan loai
Dac.tnmg chudS
Dactntngvan bin
oHe dieu hanh: windows 7
o.NET frameword 4.0 hoac cao hon
Core 2 Dual 2.0GhzCau hinh de nghi
866x600
2GB 512MB
Coleron 2.0GhzCau hinh toi thieu
Trang 40Trang 30SV: Duong Thanh True - DTH082062
Hinh 21:Giao dien chinh chuong trinh
MJlkh4u.
Ell DANG NHAP
Hinh 20: So do giao dien he thong
b Thiit ke giao dien
Danh giaQuan ly
Xay dungtudong^ i
trcrgiup
rrQuairryEp
~1 phan loai
Xay dmjg bpphan loai
Dae tningvan ban
X^uah1 lyvan ban
He thgrig phan loai van,