Language Modelling• NPath Lists and Lattices • NGram Language Models • Word NetWord Expansion... Ngram Language Models• Database preparation • Mapping OOV words • Language Model Generat
Trang 1Language Modelling
Trang 2Language Modelling
• NPath Lists and Lattices
• NGram Language Models
• Word NetWord Expansion
Trang 3NPath Lists
<S>
W2
W1
W5
W4
W3
W7
W6
</S>
n
Trang 4HVite -n 4 -z lat -l “/lattice"
-C config_hvite
-H hmm30/macros -H hmm30/hmmdefs
-S dtnvn1106a.scp
-i rec_out_lattice.mlf
-w wdnet_bigram
-p 0.0 -s 5.0
dict.txt tiedlist
Trang 5VERSION=1.0
UTTERANCE=Dtnvn2307/DTNVN1106A_10_S1388583.mfc
lmname=wdnet_bigram
lmscale=5.00 wdpenalty=0.00
acscale=1.00
vocab=dict.txt
N=8307 L=32095
I=0 t=0.00 W=!NULL
…
I=8306 t=3.19 W=!EXIT v=1
J=0 S=0 E=1 a=-159.04 l=0.000
J=1 S=0 E=2 a=-239.20 l=0.000
…
J=32094 S=8304 E=8306 a=-199.93 l=-1.860
Trang 6Ngram Language Models
• Database preparation
• Mapping OOV words
• Language Model Generation
• Testing the LM perplexity
• Generating and using count-besed models
(dynamically adjusted)
• Model interpolation
(LMerge)
• Class-bases models
Trang 7Ngram Language Models
Gramfiles
…
HE SEEMED TO BE : 5
HE SEEMED TO TAKE : 1
HE SEEMS A VERY : 1
HE SEEMS TO BE : 2
…
Trang 8Database preparation
Step 1: (new)
LNewMap -f WFC Holmes empty.wmap
Step 2: (count)
LGPrep -T 1 -a 100000 -b 200000
-d holmes.0 -n 4 -s "Sherlock Holmes"
empty.wmap -S listText.txt
Text1.txt
<s> QUOTE HOLMES QUOTE SAID I …</s>
<s> IT SEEMS RATHER SAD THAT …</s>
A text corpus of 10M word units is free for all researchers
Trang 9Database preparation
holmes.0
gram.0 gram.1 gram.2 wmap Step 3: (sort+sequence)
LGCopy -T 1 -b 200000 -d holmes.1
holmes.0/wmap holmes.0/gram.*
holmes.1
data.0 data.1
Trang 10Mapping OOV words
Step 4: LGCopy -T 1 -o -m lm_5k/5k.wmap
-b 200000 -d lm_5k
-w 5k.wlist
holmes.0/wmap -S listdata.txt
5k.wlist:5000 most common words
lm_5k/data.0
…
<s> IT IS !!UNK : 17
<s> IT LOOKS !!UNK : 2
<s> IT MUST !!UNK : 1
<s> IT SEEMED !!UNK : 1
Trang 11Language Model Generation
Step 5: LFoF -T 1 -n 4 -f 32
lm_5k/5k.wmap lm_5k/5k.fof
-S listdata.txt lm_5k\data.0
Calculate Frequency of Frequency table
Step 6: LBuild -T 1
-c 2 1 -c 3 1
-n 3
lm_5k/5k.wmap lm_5k/trigram_1
holmes.1/data.*
lm_5k/data.*
Trang 12Testing the LM perplexity
Step 7: LPlex
-n 3 -t lm_5k/trigram_1 test/red-headed_league.txt
Trang 13Class-bases models
Step 8: Cluster -T 1 -c 150 -i 1 -k
-o holmes.2/class lm_5k/5k.wmap
-S listdata.txt lm_5k/data.0
holmes.2/data.0
…
CLASS1 CLASS10 CLASS17 CLASS145 : 1 CLASS1 CLASS10 CLASS18 CLASS126 : 7
Step 9:
LGCopy -T 1 -d holmes.2 -m holmes.2/cmap
-w holmes.2/class.1.cm lm_5k/5k.wmap
-S listdata.txt lm_5k/data.0
Trang 14Class-bases models
Step 10:
LBuild -T 1 -c 2 1 -c 3 1 -n 3
holmes.2/cmap
lm_5k/trigram_2_cc
holmes.2/data.0
Step 12: LLink lm_5k/trigram_2_cc
Step 11: Cluster -l holmes.2/class.1.cm -i 0
-q lm_5k/trigram_2_wc
lm_5k/5k.wmap -S listdata.txt lm_5k/data.0 (-i n Perform n iterations.)
Trang 15Word NetWord Expansion
2
3 SIL-W1
4
SIL-W2
5
8
W2-W3
6
W1-W5 W1-W4
7 W4-W6
W3-W6
9 W5-SIL
W6-SIL
2
3
6 w1
w4 w2
w3
w6
w5
Uni-gram
Bi-gram
Trang 16Word NetWord Expansion
SIL-W1 (W1)
<S>
W1-W5 (W5)
W1-W4 (W4)
SIL-W1-W5
SIL-W1-W4
SIL-W2 (W2)
W2-W3 (W3) SIL-W2-W3
W5-SIL (SIL)
W4-W6 (W6)
W3-W6 (W6)
W6-SIL
W1-W5-SIL
W1-W4-W6
W2-W3-W6
W3-W6-SIL W4-W6-SIL
Tri-gram
Trang 17HLRescore [options] vocabFile LatFiles
• lattice generation with HVite using a bigram
• lattice pruning with HLRescore (-t)
• expansion of lattices using a trigram (-n)
• finding 1-best transcription in the expanded lattice (-f)
Trang 18[Young, 2005] S J Young, “HTK Speech Recognition
Toolkit", htk.eng.cam.ac.uk, 2005.