1. Trang chủ
  2. » Giáo Dục - Đào Tạo

Burrows – Wheeler Transform Its Properties And Applications

126 278 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 126
Dung lượng 1,63 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

High order compressors● We can achieve better compression if the codeword we assign to a symbol also depends on the k symbols preceding it its context... High order compressors● We can

Trang 2

Basic Concepts in Data

Compression

– We would like to design a compressor that, give a text in input, represents is using the smallest possible number of bits From this representation we must be able to

reconstruct the original text without any loss

of information.

Historical motivations:

– Save storage space and/or bandwidth.

Trang 3

0-th order compressors

C

Trang 4

0-th order compressors

● Build a table: for each symbol

stores its frequency

abac

5/112/111/11

Trang 5

0-th order compressors

● Build a table: for each symbol

stores its frequency

● Assign a codeword to each

code

110111

abac

5/112/111/11

Trang 6

0-th order compressors

● Build a table: for each symbol

stores its frequency

● Assign a codeword to each

symbol So that,

● Decompression: Codewords must be

uniquely decodable

● Minimize compress size: Shortest

codewords must be assigned to most

frequent symbols

C

0100101

code

110111

abac

5/112/111/11

freq

d 1/11

r 2/11

char

Trang 7

0-th order compressors

● Build a table: for each symbol

stores its frequency

● Assign a codeword to each

symbol So that,

● Decompression: Codewords must be

uniquely decodable

● Minimize compress size: Shortest

codewords must be assigned to most

frequent symbols

● Replace each symbol with its

0100101

code

110111

abac

5/112/111/11

Trang 8

110111

abac

5/112/111/11

Trang 9

0-th order compressors

● Decompression is easy:

● Scan C from left to right

● Every time we identify a

codeword, we emit the

corresponding symbol.

0100101

code

110111

abac

5/112/111/11

Trang 10

0-th order compressors

● Decompression is easy:

● Scan C from left to right

● Every time we identify a

codeword, emit the

corresponding symbol.

0100101

code

110111

abac

5/112/111/11

Trang 11

High order compressors

● We can achieve better compression if the codeword we assign to a symbol also depends on the k symbols

preceding it (its context)

Trang 12

High order compressors

● We can achieve better compression if the codeword we assign to a symbol also depends on the k symbols

preceding it (its context)

Build a table for each context

of length k in S

Trang 13

High order compressors

● We can achieve better compression if the codeword we assign to a symbol also depends on the k symbols

preceding it (its context)

Trang 14

High order compressors

● We can achieve better compression if the codeword we assign to a symbol also depends on the k symbols

preceding it (its context)

k=2

context = ab

-code

-abac

0/20/20/2

Trang 15

High order compressors

● We can achieve better compression if the codeword we assign to a symbol also depends on the k symbols

preceding it (its context)

k=2

context = ab

-code

-abac

0/20/20/2

Trang 16

The models are the problem!

● The compression improves because we better predict the next symbol.

Trang 17

The models are the problem!

● The compression improves because we better predict the next symbol.

● Problem:

● Larger is k, smaller is the compress

Trang 18

The models are the problem!

● The compression improves because we better predict the next symbol.

● Problem:

● Larger is k, smaller is the compress

● but we have to store more tables:

● O( σk+1 log σ ) bits in the worst case

Trang 19

The models are the problem!

● The compression improves because we better predict

the next symbol.

● Problem:

● Larger is k, smaller is the compress

● but we have to store more tables:

● O( σk+1 log σ ) bits in the worst case

● Since compress size = |C|+ size tables, this approach

require a lot of tuning in order to find the best value of k (i.e., the value of k that minimizes compress size)

Trang 20

The models are the problem!

● The compression improves because we better predict the next symbol.

● Problem:

● Larger is k, smaller is the compress

● but we have to store more tables:

● O( σk+1 log σ ) bits in the worst case

● Since compress size = |C|+ size tables, this approach requires a lot of tuning in order to find the best value of

k (i.e., the value of k that minimizes compress size)

● Instead, we would like to have a method that use a 0-th order compressor without care about the length of the contexts

Trang 21

Rearranging the input

● Idea!

– Permute the input so that it is more

compressible by a 0-th order compressor

Trang 22

Rearranging the input

● Idea!

– Permute the input so that it is more

compressible by a 0-th order compressor

● Easiest way: sort the symbol lexicographically

Trang 23

Rearranging the input

● Idea!

– Permute the input so that it is more

compressible by a 0-th order compressor

● Easiest way: sort the symbol lexicographically

(#,1)(a,5)(b,2)(c,1)(d,1)(r,2)

Trang 24

Rearranging the input

● Idea!

– Permute the input so that it is more

compressible by a 0-th order compressor

● Easiest way: sort the symbol lexicographically

Trang 25

Rearranging the input

● Idea!

– Permute the input so that it is more

compressible by a 0-th order compressor

● Easiest way: sort the symbol lexicographically

Which is the problem?

Trang 26

Rearranging the input

● Idea!

– Permute the input so that it is more

compressible by a 0-th order compressor

● Easiest way: sort the symbol lexicographically

Which is the problem?

The transformation is not reversible!

There are 997.920 distinct strings with

Trang 27

Rearranging the input

● Idea!

– Permute the input so that it is more

compressible by a 0-th order compressor

● Easiest way: sort the symbol lexicographically

● What do you think about this one?

ard#rcaaaabb

Trang 28

Rearranging the input

● Idea!

– Permute the input so that it is more

compressible by a 0-th order compressor

● Easiest way: sort the symbol lexicographically

● What do you think about this one?

ard#rcaaaabb

Trang 29

Burrows-Wheeler Transform

[Burrows-Wheeler, 1994]

Let us given S = abracadabra#

Trang 30

Burrows-Wheeler Transform

[Burrows-Wheeler, 1994]

Let us given S = abracadabra#

abracadabra#

Trang 33

#abracadabra

Trang 34

#abracadabra

Trang 35

r

Trang 36

Burrows-Wheeler Transform

[Burrows-Wheeler, 1994]

Let us given S = abracadabra#

Sort the rows

abracadabra#

bracadabra#aracadabra#ab acadabra#abrcadabra#abraadabra#abracdabra#abracaabra#abracadbra#abracadara#abracadaba#abracadabr

#abracadabra

Trang 37

Burrows-Wheeler Transform

[Burrows-Wheeler, 1994]

Let us given S = abracadabra#

Sort the rows

# abracadabr a

abracadabra#

bracadabra#aracadabra#ab acadabra#abrcadabra#abraadabra#abracdabra#abracaabra#abracadbra#abracadara#abracadaba#abracadabr

#abracadabra

Trang 38

Burrows-Wheeler Transform

[Burrows-Wheeler, 1994]

Let us given S = abracadabra#

Sort the rows

# abracadabr a

a #abracadab r

abracadabra#

bracadabra#aracadabra#ab acadabra#abrcadabra#abraadabra#abracdabra#abracaabra#abracadbra#abracadara#abracadaba#abracadabr

#abracadabra

Trang 39

Burrows-Wheeler Transform

[Burrows-Wheeler, 1994]

Let us given S = abracadabra#

Sort the rows

#abracadabra

Trang 40

Burrows-Wheeler Transform

[Burrows-Wheeler, 1994]

a bracadabra #

Let us given S = abracadabra#

Sort the rows

Trang 41

Burrows-Wheeler Transform

[Burrows-Wheeler, 1994]

a bracadabra #

Let us given S = abracadabra#

Sort the rows

Trang 42

Burrows-Wheeler Transform

[Burrows-Wheeler, 1994]

a bracadabra #

Let us given S = abracadabra#

Sort the rows

Trang 43

Burrows-Wheeler Transform

[Burrows-Wheeler, 1994]

a bracadabra #

Let us given S = abracadabra#

Sort the rows

Trang 44

Burrows-Wheeler Transform:

why it works [Burrows-Wheeler, 1994]

Trang 47

[Burrows-Wheeler, 1994]

Trang 48

Reverse the BWT

Trang 49

To reobtain S from L we need to map

any symbol in L to its corresponding occurrence in F LF-Mapping

Trang 50

To reobtain S from L we need to map

any symbol in L to its corresponding occurrence in F LF-Mapping

unknown

Simple! For symbols with just one occurrence

?

?

Trang 51

To reobtain S from L we need to map

any symbol in L to its corresponding occurrence in F LF-Mapping

unknown

Simple! For symbols with just one occurrence

Trang 52

To reobtain S from L we need to map

any symbol in L to its corresponding occurrence in F LF-Mapping

Trang 53

To reobtain S from L we need to map

any symbol in L to its corresponding occurrence in F LF-Mapping

Trang 54

To reobtain S from L we need to map

any symbol in L to its corresponding occurrence in F LF-Mapping

Trang 55

To reobtain S from L we need to map

any symbol in L to its corresponding occurrence in F LF-Mapping

Trang 56

To reobtain S from L we need to map

any symbol in L to its corresponding occurrence in F LF-Mapping

Trang 57

To reobtain S from L we need to map

any symbol in L to its corresponding occurrence in F LF-Mapping

Trang 58

To reobtain S from L we need to map

any symbol in L to its corresponding occurrence in F LF-Mapping

Trang 59

To reobtain S from L we need to map

any symbol in L to its corresponding occurrence in F LF-Mapping

Trang 60

To reobtain S from L we need to map

any symbol in L to its corresponding occurrence in F LF-Mapping

First a in L maps to first a in F,

Rotate them rightward

to obtain their cyclic-rot

<

Trang 61

To reobtain S from L we need to map

any symbol in L to its corresponding occurrence in F LF-Mapping

First a in L maps to first a in F,

Rotate them rightward i.e., move the a in front

<

Trang 62

To reobtain S from L we need to map

any symbol in L to its corresponding occurrence in F LF-Mapping

Trang 63

To reobtain S from L we need to map

any symbol in L to its corresponding occurrence in F LF-Mapping

Trang 64

We reconstruct S backward

Trang 65

We reconstruct S backward

Trang 66

We reconstruct S backward

Trang 67

We reconstruct S backward

Trang 68

We reconstruct S backward

Trang 69

We reconstruct S backward

Trang 70

We reconstruct S backward

Trang 71

We reconstruct S backward

Trang 72

We reconstruct S backward

Trang 73

We reconstruct S backward

Trang 74

r = LF[r]; i ;

}

Trang 75

●Update the counter C[L[i]]++

Starting from L, the LF-mapping can be computed in

linear time using a simple algorithm

Trang 76

Compress the BWT

Trang 77

[Burrows-Wheeler, 1994]

Trang 78

[Burrows-Wheeler, 1994]

B&W suggest to compress L

in 3 steps:

● Move-To-Front

● Run Length Encoding

● Statistical Encoder (e.g Huffman)

Trang 79

[Bentley-Sleator-Tarjan-Wei, 1986]

L = a r d r c a a a a b b

C =

Transform L which is locally homogeneous into

a new string that is globally homogeneous

Trang 80

[Bentley-Sleator-Tarjan-Wei, 1986]

abacdr

List

Symbols in the alphabet Initially, they are sorted lexicographically

L = a r d r c a a a a b b

C =

Transform L which is locally homogeneous into

a new string that is globally homogeneous

Trang 81

[Bentley-Sleator-Tarjan-Wei, 1986]

abacdr

List

L = a r d r c a a a a b b

C =

Transform L which is locally homogeneous into

a new string that is globally homogeneous

Trang 82

[Bentley-Sleator-Tarjan-Wei, 1986]

abacdr

Transform L which is locally homogeneous into

a new string that is globally homogeneous

Trang 83

Transform L which is locally homogeneous into

a new string that is globally homogeneous

[Bentley-Sleator-Tarjan-Wei, 1986]

abacdr

List

L = a r d r c a a a a b b

C = 0 Move a in front of List

Trang 84

[Bentley-Sleator-Tarjan-Wei, 1986]

abacdr

List

L = a r d r c a a a a b b

C = 0 4

Transform L which is locally homogeneous into

a new string that is globally homogeneous

Trang 85

[Bentley-Sleator-Tarjan-Wei, 1986]

abacdr

List

L = a r d r c a a a a b b

C = 0 4

Transform L which is locally homogeneous into

a new string that is globally homogeneous

Trang 86

[Bentley-Sleator-Tarjan-Wei, 1986]

raa

bcd

List

L = a r d r c a a a a b b

C = 0 4

Transform L which is locally homogeneous into

a new string that is globally homogeneous

Trang 87

[Bentley-Sleator-Tarjan-Wei, 1986]

baacrd

List

L = a r d r c a a a a b b

C = 0 4 4 1 4 3 0 0 0 4 0

Transform L which is locally homogeneous into

a new string that is globally homogeneous

Trang 88

[Bentley-Sleator-Tarjan-Wei, 1986]

baacrd

List

L = a r d r c a a a a b b

C = 0 4 4 1 4 3 0 0 0 4 0

Number of distinct symbols

since the previous occurrence of a

Transform L which is locally homogeneous into

a new string that is globally homogeneous

Trang 89

[Bentley-Sleator-Tarjan-Wei, 1986]

baacrd

List

L = a r d r c a a a a b b

C = 0 4 4 1 4 3 0 0 0 4 0

If the string is locally homogeneous,

C contain a lot of small numbers.

Transform L which is locally homogeneous into

a new string that is globally homogeneous

Trang 90

[Bentley-Sleator-Tarjan-Wei, 1986]

baacrd

Transform L which is locally homogeneous into

a new string that is globally homogeneous

Trang 91

Burrows-Wheeler Transform:

why Move-to-Front

Shakespeare's Hamlet

Trang 92

Burrows-Wheeler Transform:

why Move-to-Front

Shakespeare's Hamlet

+ 0 + 1 0 0 + 1 0 0 0 0 0 0 0 + 1 0 0 0

MTF(L)

Trang 93

After MTF

● Now we have a string with small

numbers: lots of 0s, many 1s, …

● Skewed frequencies: next steps RLE +Huffman!

Symbol frequencies

Trang 98

How to build (efficiently) the BWT?

Trang 99

How to build the BWT

a bracadabra #

Let us given S = abracadabra#

Sort the rows

Trang 100

How to build the BWT?

a bracadabra #

Let us given S = abracadabra#

Sort the rows

Trang 101

SA stores the starting positions in S

of the suffixes sorted lexicographically

Trang 102

SA stores the starting positions in S

of the suffixes sorted lexicographically

Trang 103

Suffix Array

Let us given S = abracadabra#

#

1211

SA

a#

SA stores the starting positions in S

of the suffixes sorted lexicographically

Trang 104

Suffix Array

Let us given S = abracadabra#

#

12118

ra#

146925710

SA stores the starting positions in S

of the suffixes sorted lexicographically

Trang 105

Suffix Array

abracadabra#

Let us given S = abracadabra#

#a#

ra#

12118146925710

SA

abracadabra #

#abracadabr aa#abracadab rabra#abraca d

L

bra#abracad aadabra#abra cacadabra#ab r

bracadabra# acadabra#abr adabra#abrac ara#abracada b

F

Trang 106

Suffix Array

abracadabra#

Let us given S = abracadabra#

#a#

ra#

12118146925710

SA

abracadabra #

#abracadabr aa#abracadab rabra#abraca d

L

bra#abracad aadabra#abra cacadabra#ab r

bracadabra# acadabra#abr adabra#abrac ara#abracada b

F

Trang 107

Suffix Array

abracadabra#

Let us given S = abracadabra#

#a#

ra#

12118146925710

SA

abracadabra #

#abracadabr aa#abracadab rabra#abraca d

L

bra#abracad aadabra#abra cacadabra#ab r

bracadabra# acadabra#abr adabra#abrac ara#abracada b

F

Given SA, we have L[i] = S[SA[i]-1]

Trang 108

Suffix Array

abracadabra#

Let us given S = abracadabra#

#a#

ra#

12118146925710

SA

abracadabra #

#abracadabr aa#abracadab rabra#abraca d

L

bra#abracad aadabra#abra cacadabra#ab r

bracadabra# acadabra#abr adabra#abrac ara#abracada b

Trang 109

How construct SA?

abracadabra#

#a#

SA Elegant but inefficient

• Θ(n2 log n) time in the worst-case

• Ο(n) space

Trang 111

Suffix Array

Last years, many clever algorithms that compute the Suffix Array in linear time:

● Karkkainen-Sanders [J ACM, 2006]

● Ko-Aluru [J Discrete Algorithms, 2005

and practical ones (no linear time):

● Manzini-Ferragina [Algorithmica, 2004]

● Maniscalco-Puglisi [ACM JEA, 2006]

Trang 112

Suffix Array

Last years, many clever algorithms that compute the Suffix Array in linear time:

● Karkkainen-Sanders [J ACM, 2006]

● Ko-Aluru [J Discrete Algorithms, 2005

and practical ones (no linear time):

it was considered too slow!

Trang 113

Suffix Array

Last years, many clever algorithms that

compute the Suffix Array in linear time:

● Karkkainen-Sanders [J ACM, 2006]

● Ko-Aluru [J Discrete Algorithms, 2005

and practical ones (no linear time):

it was considered too slow!

In 1994, the paper was rejected and BWT is still a Technical Report

Trang 114

A compressor based on BWT:

Bzip2

Trang 115

You find this in your Linux

distribution

Trang 116

A real implementation: bzip2

It doesn't build the BWT of the whole text but uses blocking approach

S

100-900Kb

Trang 117

A real implementation: bzip2

It doesn't build the BWT of the whole text

but uses blocking approach

Just for time performance, the compression it achieves is worse

S

100-900Kb

Trang 118

bzip2 vs gzip

gzip bzip2 bigbzip

English 5Mb comp size

2.0 Mb 1.5 Mb 1.3 Mb

comp time 1.7 secs 2.2 secs 2.4 secs

Ngày đăng: 30/10/2015, 18:04

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w