High order compressors● We can achieve better compression if the codeword we assign to a symbol also depends on the k symbols preceding it its context... High order compressors● We can
Trang 2Basic Concepts in Data
Compression
– We would like to design a compressor that, give a text in input, represents is using the smallest possible number of bits From this representation we must be able to
reconstruct the original text without any loss
of information.
● Historical motivations:
– Save storage space and/or bandwidth.
Trang 30-th order compressors
C
Trang 40-th order compressors
● Build a table: for each symbol
stores its frequency
abac
5/112/111/11
Trang 50-th order compressors
● Build a table: for each symbol
stores its frequency
● Assign a codeword to each
code
110111
abac
5/112/111/11
Trang 60-th order compressors
● Build a table: for each symbol
stores its frequency
● Assign a codeword to each
symbol So that,
● Decompression: Codewords must be
uniquely decodable
● Minimize compress size: Shortest
codewords must be assigned to most
frequent symbols
C
0100101
code
110111
abac
5/112/111/11
freq
d 1/11
r 2/11
char
Trang 70-th order compressors
● Build a table: for each symbol
stores its frequency
● Assign a codeword to each
symbol So that,
● Decompression: Codewords must be
uniquely decodable
● Minimize compress size: Shortest
codewords must be assigned to most
frequent symbols
● Replace each symbol with its
0100101
code
110111
abac
5/112/111/11
Trang 8110111
abac
5/112/111/11
Trang 90-th order compressors
● Decompression is easy:
● Scan C from left to right
● Every time we identify a
codeword, we emit the
corresponding symbol.
0100101
code
110111
abac
5/112/111/11
Trang 100-th order compressors
● Decompression is easy:
● Scan C from left to right
● Every time we identify a
codeword, emit the
corresponding symbol.
0100101
code
110111
abac
5/112/111/11
Trang 11High order compressors
● We can achieve better compression if the codeword we assign to a symbol also depends on the k symbols
preceding it (its context)
Trang 12High order compressors
● We can achieve better compression if the codeword we assign to a symbol also depends on the k symbols
preceding it (its context)
Build a table for each context
of length k in S
Trang 13High order compressors
● We can achieve better compression if the codeword we assign to a symbol also depends on the k symbols
preceding it (its context)
Trang 14High order compressors
● We can achieve better compression if the codeword we assign to a symbol also depends on the k symbols
preceding it (its context)
k=2
context = ab
-code
-abac
0/20/20/2
Trang 15High order compressors
● We can achieve better compression if the codeword we assign to a symbol also depends on the k symbols
preceding it (its context)
k=2
context = ab
-code
-abac
0/20/20/2
Trang 16The models are the problem!
● The compression improves because we better predict the next symbol.
Trang 17The models are the problem!
● The compression improves because we better predict the next symbol.
● Problem:
● Larger is k, smaller is the compress
Trang 18The models are the problem!
● The compression improves because we better predict the next symbol.
● Problem:
● Larger is k, smaller is the compress
● but we have to store more tables:
● O( σk+1 log σ ) bits in the worst case
Trang 19The models are the problem!
● The compression improves because we better predict
the next symbol.
● Problem:
● Larger is k, smaller is the compress
● but we have to store more tables:
● O( σk+1 log σ ) bits in the worst case
● Since compress size = |C|+ size tables, this approach
require a lot of tuning in order to find the best value of k (i.e., the value of k that minimizes compress size)
Trang 20The models are the problem!
● The compression improves because we better predict the next symbol.
● Problem:
● Larger is k, smaller is the compress
● but we have to store more tables:
● O( σk+1 log σ ) bits in the worst case
● Since compress size = |C|+ size tables, this approach requires a lot of tuning in order to find the best value of
k (i.e., the value of k that minimizes compress size)
● Instead, we would like to have a method that use a 0-th order compressor without care about the length of the contexts
Trang 21Rearranging the input
● Idea!
– Permute the input so that it is more
compressible by a 0-th order compressor
Trang 22Rearranging the input
● Idea!
– Permute the input so that it is more
compressible by a 0-th order compressor
● Easiest way: sort the symbol lexicographically
Trang 23Rearranging the input
● Idea!
– Permute the input so that it is more
compressible by a 0-th order compressor
● Easiest way: sort the symbol lexicographically
(#,1)(a,5)(b,2)(c,1)(d,1)(r,2)
Trang 24Rearranging the input
● Idea!
– Permute the input so that it is more
compressible by a 0-th order compressor
● Easiest way: sort the symbol lexicographically
Trang 25Rearranging the input
● Idea!
– Permute the input so that it is more
compressible by a 0-th order compressor
● Easiest way: sort the symbol lexicographically
Which is the problem?
Trang 26Rearranging the input
● Idea!
– Permute the input so that it is more
compressible by a 0-th order compressor
● Easiest way: sort the symbol lexicographically
Which is the problem?
The transformation is not reversible!
There are 997.920 distinct strings with
Trang 27Rearranging the input
● Idea!
– Permute the input so that it is more
compressible by a 0-th order compressor
● Easiest way: sort the symbol lexicographically
● What do you think about this one?
ard#rcaaaabb
Trang 28Rearranging the input
● Idea!
– Permute the input so that it is more
compressible by a 0-th order compressor
● Easiest way: sort the symbol lexicographically
● What do you think about this one?
ard#rcaaaabb
Trang 29Burrows-Wheeler Transform
[Burrows-Wheeler, 1994]
Let us given S = abracadabra#
Trang 30Burrows-Wheeler Transform
[Burrows-Wheeler, 1994]
Let us given S = abracadabra#
abracadabra#
Trang 33#abracadabra
Trang 34#abracadabra
Trang 35r
Trang 36Burrows-Wheeler Transform
[Burrows-Wheeler, 1994]
Let us given S = abracadabra#
Sort the rows
abracadabra#
bracadabra#aracadabra#ab acadabra#abrcadabra#abraadabra#abracdabra#abracaabra#abracadbra#abracadara#abracadaba#abracadabr
#abracadabra
Trang 37Burrows-Wheeler Transform
[Burrows-Wheeler, 1994]
Let us given S = abracadabra#
Sort the rows
# abracadabr a
abracadabra#
bracadabra#aracadabra#ab acadabra#abrcadabra#abraadabra#abracdabra#abracaabra#abracadbra#abracadara#abracadaba#abracadabr
#abracadabra
Trang 38Burrows-Wheeler Transform
[Burrows-Wheeler, 1994]
Let us given S = abracadabra#
Sort the rows
# abracadabr a
a #abracadab r
abracadabra#
bracadabra#aracadabra#ab acadabra#abrcadabra#abraadabra#abracdabra#abracaabra#abracadbra#abracadara#abracadaba#abracadabr
#abracadabra
Trang 39Burrows-Wheeler Transform
[Burrows-Wheeler, 1994]
Let us given S = abracadabra#
Sort the rows
#abracadabra
Trang 40Burrows-Wheeler Transform
[Burrows-Wheeler, 1994]
a bracadabra #
Let us given S = abracadabra#
Sort the rows
Trang 41Burrows-Wheeler Transform
[Burrows-Wheeler, 1994]
a bracadabra #
Let us given S = abracadabra#
Sort the rows
Trang 42Burrows-Wheeler Transform
[Burrows-Wheeler, 1994]
a bracadabra #
Let us given S = abracadabra#
Sort the rows
Trang 43Burrows-Wheeler Transform
[Burrows-Wheeler, 1994]
a bracadabra #
Let us given S = abracadabra#
Sort the rows
Trang 44Burrows-Wheeler Transform:
why it works [Burrows-Wheeler, 1994]
Trang 47[Burrows-Wheeler, 1994]
Trang 48Reverse the BWT
Trang 49To reobtain S from L we need to map
any symbol in L to its corresponding occurrence in F LF-Mapping
Trang 50To reobtain S from L we need to map
any symbol in L to its corresponding occurrence in F LF-Mapping
unknown
Simple! For symbols with just one occurrence
?
?
Trang 51To reobtain S from L we need to map
any symbol in L to its corresponding occurrence in F LF-Mapping
unknown
Simple! For symbols with just one occurrence
Trang 52To reobtain S from L we need to map
any symbol in L to its corresponding occurrence in F LF-Mapping
Trang 53To reobtain S from L we need to map
any symbol in L to its corresponding occurrence in F LF-Mapping
Trang 54To reobtain S from L we need to map
any symbol in L to its corresponding occurrence in F LF-Mapping
Trang 55To reobtain S from L we need to map
any symbol in L to its corresponding occurrence in F LF-Mapping
Trang 56To reobtain S from L we need to map
any symbol in L to its corresponding occurrence in F LF-Mapping
Trang 57To reobtain S from L we need to map
any symbol in L to its corresponding occurrence in F LF-Mapping
Trang 58To reobtain S from L we need to map
any symbol in L to its corresponding occurrence in F LF-Mapping
Trang 59To reobtain S from L we need to map
any symbol in L to its corresponding occurrence in F LF-Mapping
Trang 60To reobtain S from L we need to map
any symbol in L to its corresponding occurrence in F LF-Mapping
First a in L maps to first a in F,
Rotate them rightward
to obtain their cyclic-rot
<
Trang 61To reobtain S from L we need to map
any symbol in L to its corresponding occurrence in F LF-Mapping
First a in L maps to first a in F,
Rotate them rightward i.e., move the a in front
<
Trang 62To reobtain S from L we need to map
any symbol in L to its corresponding occurrence in F LF-Mapping
Trang 63To reobtain S from L we need to map
any symbol in L to its corresponding occurrence in F LF-Mapping
Trang 64We reconstruct S backward
Trang 65We reconstruct S backward
Trang 66We reconstruct S backward
Trang 67We reconstruct S backward
Trang 68We reconstruct S backward
Trang 69We reconstruct S backward
Trang 70We reconstruct S backward
Trang 71We reconstruct S backward
Trang 72We reconstruct S backward
Trang 73We reconstruct S backward
Trang 74r = LF[r]; i ;
}
Trang 75●Update the counter C[L[i]]++
Starting from L, the LF-mapping can be computed in
linear time using a simple algorithm
Trang 76Compress the BWT
Trang 77[Burrows-Wheeler, 1994]
Trang 78[Burrows-Wheeler, 1994]
B&W suggest to compress L
in 3 steps:
● Move-To-Front
● Run Length Encoding
● Statistical Encoder (e.g Huffman)
Trang 79[Bentley-Sleator-Tarjan-Wei, 1986]
L = a r d r c a a a a b b
C =
Transform L which is locally homogeneous into
a new string that is globally homogeneous
Trang 80[Bentley-Sleator-Tarjan-Wei, 1986]
abacdr
List
Symbols in the alphabet Initially, they are sorted lexicographically
L = a r d r c a a a a b b
C =
Transform L which is locally homogeneous into
a new string that is globally homogeneous
Trang 81[Bentley-Sleator-Tarjan-Wei, 1986]
abacdr
List
L = a r d r c a a a a b b
C =
Transform L which is locally homogeneous into
a new string that is globally homogeneous
Trang 82[Bentley-Sleator-Tarjan-Wei, 1986]
abacdr
Transform L which is locally homogeneous into
a new string that is globally homogeneous
Trang 83Transform L which is locally homogeneous into
a new string that is globally homogeneous
[Bentley-Sleator-Tarjan-Wei, 1986]
abacdr
List
L = a r d r c a a a a b b
C = 0 Move a in front of List
Trang 84[Bentley-Sleator-Tarjan-Wei, 1986]
abacdr
List
L = a r d r c a a a a b b
C = 0 4
Transform L which is locally homogeneous into
a new string that is globally homogeneous
Trang 85[Bentley-Sleator-Tarjan-Wei, 1986]
abacdr
List
L = a r d r c a a a a b b
C = 0 4
Transform L which is locally homogeneous into
a new string that is globally homogeneous
Trang 86[Bentley-Sleator-Tarjan-Wei, 1986]
raa
bcd
List
L = a r d r c a a a a b b
C = 0 4
Transform L which is locally homogeneous into
a new string that is globally homogeneous
Trang 87[Bentley-Sleator-Tarjan-Wei, 1986]
baacrd
List
L = a r d r c a a a a b b
C = 0 4 4 1 4 3 0 0 0 4 0
Transform L which is locally homogeneous into
a new string that is globally homogeneous
Trang 88[Bentley-Sleator-Tarjan-Wei, 1986]
baacrd
List
L = a r d r c a a a a b b
C = 0 4 4 1 4 3 0 0 0 4 0
Number of distinct symbols
since the previous occurrence of a
Transform L which is locally homogeneous into
a new string that is globally homogeneous
Trang 89[Bentley-Sleator-Tarjan-Wei, 1986]
baacrd
List
L = a r d r c a a a a b b
C = 0 4 4 1 4 3 0 0 0 4 0
If the string is locally homogeneous,
C contain a lot of small numbers.
Transform L which is locally homogeneous into
a new string that is globally homogeneous
Trang 90[Bentley-Sleator-Tarjan-Wei, 1986]
baacrd
Transform L which is locally homogeneous into
a new string that is globally homogeneous
Trang 91Burrows-Wheeler Transform:
why Move-to-Front
Shakespeare's Hamlet
Trang 92Burrows-Wheeler Transform:
why Move-to-Front
Shakespeare's Hamlet
+ 0 + 1 0 0 + 1 0 0 0 0 0 0 0 + 1 0 0 0
MTF(L)
Trang 93After MTF
● Now we have a string with small
numbers: lots of 0s, many 1s, …
● Skewed frequencies: next steps RLE +Huffman!
Symbol frequencies
Trang 98How to build (efficiently) the BWT?
Trang 99How to build the BWT
a bracadabra #
Let us given S = abracadabra#
Sort the rows
Trang 100How to build the BWT?
a bracadabra #
Let us given S = abracadabra#
Sort the rows
Trang 101SA stores the starting positions in S
of the suffixes sorted lexicographically
Trang 102SA stores the starting positions in S
of the suffixes sorted lexicographically
Trang 103Suffix Array
Let us given S = abracadabra#
#
1211
SA
a#
SA stores the starting positions in S
of the suffixes sorted lexicographically
Trang 104Suffix Array
Let us given S = abracadabra#
#
12118
ra#
146925710
SA stores the starting positions in S
of the suffixes sorted lexicographically
Trang 105Suffix Array
abracadabra#
Let us given S = abracadabra#
#a#
ra#
12118146925710
SA
abracadabra #
#abracadabr aa#abracadab rabra#abraca d
L
bra#abracad aadabra#abra cacadabra#ab r
bracadabra# acadabra#abr adabra#abrac ara#abracada b
F
Trang 106Suffix Array
abracadabra#
Let us given S = abracadabra#
#a#
ra#
12118146925710
SA
abracadabra #
#abracadabr aa#abracadab rabra#abraca d
L
bra#abracad aadabra#abra cacadabra#ab r
bracadabra# acadabra#abr adabra#abrac ara#abracada b
F
Trang 107Suffix Array
abracadabra#
Let us given S = abracadabra#
#a#
ra#
12118146925710
SA
abracadabra #
#abracadabr aa#abracadab rabra#abraca d
L
bra#abracad aadabra#abra cacadabra#ab r
bracadabra# acadabra#abr adabra#abrac ara#abracada b
F
Given SA, we have L[i] = S[SA[i]-1]
Trang 108Suffix Array
abracadabra#
Let us given S = abracadabra#
#a#
ra#
12118146925710
SA
abracadabra #
#abracadabr aa#abracadab rabra#abraca d
L
bra#abracad aadabra#abra cacadabra#ab r
bracadabra# acadabra#abr adabra#abrac ara#abracada b
Trang 109How construct SA?
abracadabra#
#a#
SA Elegant but inefficient
• Θ(n2 log n) time in the worst-case
• Ο(n) space
Trang 111Suffix Array
Last years, many clever algorithms that compute the Suffix Array in linear time:
● Karkkainen-Sanders [J ACM, 2006]
● Ko-Aluru [J Discrete Algorithms, 2005
and practical ones (no linear time):
● Manzini-Ferragina [Algorithmica, 2004]
● Maniscalco-Puglisi [ACM JEA, 2006]
Trang 112Suffix Array
Last years, many clever algorithms that compute the Suffix Array in linear time:
● Karkkainen-Sanders [J ACM, 2006]
● Ko-Aluru [J Discrete Algorithms, 2005
and practical ones (no linear time):
it was considered too slow!
Trang 113Suffix Array
Last years, many clever algorithms that
compute the Suffix Array in linear time:
● Karkkainen-Sanders [J ACM, 2006]
● Ko-Aluru [J Discrete Algorithms, 2005
and practical ones (no linear time):
it was considered too slow!
In 1994, the paper was rejected and BWT is still a Technical Report
Trang 114A compressor based on BWT:
Bzip2
Trang 115You find this in your Linux
distribution
Trang 116A real implementation: bzip2
It doesn't build the BWT of the whole text but uses blocking approach
S
100-900Kb
Trang 117A real implementation: bzip2
It doesn't build the BWT of the whole text
but uses blocking approach
Just for time performance, the compression it achieves is worse
S
100-900Kb
Trang 118bzip2 vs gzip
gzip bzip2 bigbzip
English 5Mb comp size
2.0 Mb 1.5 Mb 1.3 Mb
comp time 1.7 secs 2.2 secs 2.4 secs