1. Trang chủ
  2. » Ngoại Ngữ

Asymptotic results in over and under representation of words in DNA

55 325 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 55
Dung lượng 355,86 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

... score defined above in equation (1.1), and analyze the over- and underrepresentation of a finite set of DNA words as the sequence length goes to infinity, by investigating the behavior of the extrema... the tail probability of a standard normal distribution when x √ and n tend to infinity with x ≤ c ln n To obtain the asymptotic results in over and under representation of DNA word counts, we... under M0 38 4.2 DNA Sequences under M1 42 iv Summary Identifying over- and under- represented words is often useful in extracting information of DNA

Trang 1

DEPARTMENT OF MATHEMATICS

NATIONAL UNIVERSITY OF SINGAPORE

2005

Trang 2

Firstly, I would like to express my sincere thanks to my advisor, Professor Chen,Louis H Y for his help and guidance in these two years Professor Chen suggestedthe research topic on over- and under-representation of words in DNA sequences to

me, which is interesting and inspiring From the research work, I not only gainedknowledge of basics of Computational Biology, but also learned many importantmodern techniques in probability theories, like the Chen-Stein method Meanwhile,Professor Chen gave me precious advice on my research work and taught me how

to think rigorously He always encouraged me to open my mind and discover newmethods independently The academic training I acquired during these two yearswill greatly benefit my future research

I would also like to thank Professor Shao Qiman for his inspiring suggestionwhich leads to Remark 2.4 and finally to the proof of Theorem 3.9; and AssociateProfessor Choi Kwok Pui, who helped me revise the first draft of my thesis andgave me many valuable suggestions

My thank also goes to Mr Chew Soon Huat David, who gave me guidance

in conducting computer simulations; Mr Lin Honghuang for helping me generateDNA sequences and compute word counts when conducting the simulations; and

ii

Trang 3

Mr Dong Bin, for giving me advise in revising this thesis.

Finally, I would like to thank Mr Chew Soon Huat David again for providingthis wonderful Latex template of thesis

Trang 4

3 Asymptotic Results of Words in DNA 17

3.1 Tail Probabilities of Extrema of Sums of m-dependent Variables 173.2 Asymptotic Normality of Markov Chains 27

4.1 DNA Sequences under M0 384.2 DNA Sequences under M1 42

iv

Trang 5

Identifying over- and under-represented words is often useful in extracting formation of DNA sequences Since the criteria of defining these over and underrepresented words is somewhat ambiguous, we shall focus on the words of maximaland minimal occurrences, which will be definitely regarded as over- and under-represented words respectively In this thesis, we study the tail probabilities ofthe extrema over a finite set of standard normal random variables by using tech-niques like Bonferroni’s inequalities and Poisson Approximation associated withthe Chen-Stein method We apply similar techniques and the moderate deviations

in-of m-dependent random variables together, and then derive the asymptotic tail

probabilities of extrema over a set of word occurrences under M0 model Thestatistical distribution of word counts is also studied We show the asymptoticnormality of word counts under both the M0 and M1 models Finally we use com-puter simulations to study the tail probabilities of the most frequently and mostrarely occurred DNA words under both the M0 and M1 models The asymptoticresults under the M1 model are shown to be similar to those for the M0 model

v

Trang 6

4.1 Normal Q-Q plot of the sums of ξ scores of all 64 3-tuple words in

20,000 simulated DNA sequences under M0 model 39

4.2 Normal Q-Q plot of the maxima of ξ scores of all 64 3-tuple words

in 20,000 simulated DNA sequences under M0 model 40

4.3 Point-to-point plots of values of (a) Fmax(x) versus G(x), where Fmax(x) stands for the estimated probabilities b P (M ≥ x) and

G(x) = 64¡1 − Φ(x)¢; (b) Fmin(x) versus G(x), where Fmin(x)

stands for the estimated probability bP (m ≤ x) with G(x) = 64Φ(x). 41

4.4 Normal Q-Q plot of the sums of ξ scores of all 64 3-tuple words in

20,000 simulated DNA sequences under M1 model 43

4.5 Point-to-point plots of values of (a) Fmax(x) versus G(x); (b) Fmax(x) versus G(x); (c) Fmin(x) versus G(x) . 45

vi

Trang 7

Chapter 1

Introduction

The analysis of rarity and abundance of words in DNA sequences has alwaysbeen of interest in biological sequence analysis A direct way to observe whether

a DNA word occurs rarely (or frequently) in a genome is to analyze the number

of its occurrences in a DNA sequence For a DNA sequence A1A2· · · A n with

A i ∈ A = {A, C, G, T }, we define the word count (e.g Waterman (1995)) N u for

a k-tuple word u as follows,

I(A i = u1, A i+1 = u2, · · · , A i+k−1 = u k ),

where Iu (i) is the indicator of the word u occurring at the starting position A i

To determine whether a DNA word u is rare or abundant in a DNA sequence,

one needs to introduce a probability model first Typical models, such as the

stationary m-order Markov chains, have been widely considered in the literature

(Reinert et al.(2000)) In this thesis, two models for DNA sequences will be ered One is called M0 model, for which all letters are independently and identically

consid-distributed; and the other one is called M1 model, for which {A1, A2, · · · } forms a

stationary Markov chain of order 1

In order to analyze word count N u, naturally, we shall first study the possiblestatistical distribution of it for a given model of the underlying DNA sequence We

1

Trang 8

adopt a commonly used standardized score:

z u = Npu − EN u

Var(N u) (1.1)

where EN u and Var(N u ) are the mean and variance of N u respectively (Leung

et al.(1996)) Statistical distribution of the word count N u has already been wellstudied in the literature Waterman (1995) (Chapter 12) showed that the joint

distribution of a finite set of z scores can be well approximated by a multivariate

normal distribution under M1 model Several research works aim at identifyingover- and under-represented words in DNA or palindromes A word is called over-(or under-)represented if it is observed more (or less) frequently than expectedunder some specified probability model (Phillips et al (1987)) Leung et al.(1996)

identified over- and under-represented short DNA words by ranking their z Lscores

(maximum likelihood plug-in z scores) in a specific genome Chew et al (2003)

studied the over- and under-representation of the accumulative counts of all

palin-dromes of certain length by identifying their upper and lower 5% z scores of a

standard normal distribution In these studies, the criteria one used to identifythe over- (or under-)representation were different Indeed, for different purposes

in biological studies, the criteria would be different in general There is no singleuniversal way to determine whether a given word is over- (or under-)represented.However, if we consider the extreme case, i.e if we only take the words of maxi-mal and minimal occurrences, these two words are surely the over-represented andthe under-represented ones respectively (which is exactly what we will do in thisthesis)

In this thesis, we shall apply the ξ scores which is essentially the same as the z score defined above in equation (1.1), and analyze the over- and under-

representation of a finite set of DNA words as the sequence length goes to infinity,

by investigating the behavior of the extrema over their ξ scores.

We shall study the asymptotic results of ξ scores Generally, the DNA sequence

Trang 9

are long and asymptotic results may be of relevance to the statistical analysis ofthe word counts For this, we introduce the following notations.

Assuming that the DNA sequence is modelled by M0, we shall show that (see

Theorem 3.9) if there exists a finite set of ξ scores {ξ1, ξ2, · · · , ξ d }, we have

as n → ∞ and x → ∞ with 1 ≤ x ≤ c √ ln n, provided that the covariance matrix

of word counts is non-singular Here, Φ and ϕ respectively denote the distribution

function and the density function of a standard normal random variable Whenassuming the DNA sequence is M1, we will prove the asymptotic normality of

the joint distribution of ξ scores by applying a central limit theorem for random

variables under mixing condition (Billingsley (1995), Section 27) Unfortunately,under the M1 model, the convergence of the ratios P (max i ξ i >x)

d

¡

1−Φ(x)¢ and P (min i ξ i ≤−x)

dΦ(−x) to

1 for ξ scores remain unsolved.

This thesis is organized as follows Chapter 2 shows how the distribution tions of extrema of a finite set of correlated standard normal random variablesbehave when these extrema tend to extremely large or small values In Chapter 3,the asymptotic convergence of the tail probabilities of extrema is established forword counts under M0 model It is also devoted to study the asymptotic normality

func-of word counts under M0 and M1 models Results func-of simulations are presented in

Trang 10

Chapter 4, which support the asymptotic results given by Theorem 3.8 and showthe possibility that similar results will be obtained under M1 model.

Trang 11

Chapter 2

Extrema of Normal Random Variables

In this chapter, we would like to investigate the distributions of both the mum and minimum of a set of standard normal random variables More precisely,

maxi-we will try to find out the probabilities of the maximum being greater than c, and the probabilities of the minimum being less than c0, for c, c0 ∈ R Our main

theorem in this chapter shows that, when c is large enough and c0 is small enough,the asymptotic tail distributions of the both extrema follow certain expressions in

terms of c and c0respectively We will present two methods in proving this theorem,one using Bonferroni’s inequalities and the other one using Poisson approximationassociated with the Chen-Stein method

To facilitate the proof of Theorem 2.7, we need a few lemmas first The firstlemma was given by Barbour et al.(1992) To make this thesis self-contained, weshall provide its proof, which is essentially the same as that of Barbour et al.(1992)

Throughout this section, we assume the correlation of two random variables X and

Y , r, be strictly bounded between -1 and 1, i.e −1 < r = corr(X, Y ) < 1.

5

Trang 12

Lemma 2.1 Let (X, Y ) be jointly normally distributed with mean vector 0 and

If −1 < r ≤ 0, the inequalities are reversed.

(ii) If 0 ≤ r < 1, then for any nonpositive a and b,

If −1 < r ≤ 0, the inequalities are reversed.

Proof For part (i),

Trang 13

If 0 ≤ r < 1, we get the lower bound immediately Next, we want to prove that

the function 1−Φ(x) ϕ(x) is decreasing Let f (x) = 1−Φ(x) ϕ(x) Then f 0 (x) = −1 + x

which gives the upper bound Due to equation (2.1), the lower and upper bounds

are reversed when r < 0 Hence, the same argument can be used to derive the reversed inequalities for −1 < r ≤ 0.

For part (ii), since

P (X > a, Y > b) = P (−X < −a, −Y < −b)

= P (X < −a, Y < −b) = P (X ≤ −a, Y ≤ −b), the same argument works when we take a and b to be nonpositive And the

Trang 14

As a result, the inequalities are established when 0 6 r < 1 for nonpositive

a and b The same argument works when −1 < r ≤ 0, with the inequalities

If −1 < r ≤ 0, the inequalities are reversed.

(ii) If 0 6 r < 1, then for any nonpositive a and b,

If −1 < r ≤ 0, the inequalities are reversed.

Proof This lemma is the direct result of Lemma 2.1 by substituting b into a.

The above two lemmas give the exact expressions of the lower and upper bounds

of probability P (X > a, Y > a) Next, we would like to find out the asymptotic

behavior of P (X > a, Y > a) as a tends to infinity The rate of convergence of

P (X > a, Y > a) is also expected in following lemma.

Lemma 2.3 Let (X, Y ) be jointly normally distributed with mean vector 0 and

Trang 15

Proof When a → ∞, 1 − Φ(

q

1−r 1+r a) → 0 Immediately, by applying the squeeze

theorem, Lemma 2.2 yields equations (2.2) and (2.3)

Remark 2.4 The upper and lower bounds obtained by Lemma 2.1 are vary tight

that will refine the error bounds in normal approximation problems However, it

is not necessary to use such tight bounds to prove Lemma 2.3 as can be seen asfollows

Since X and Y are jointly normal, then X + Y is normal with E(X + Y ) = 0 and Var(X + Y ) = 2(1 + r) Hence √ (X+Y )

2(1+r) is standard normal Therefore,

1+r Obviously, we have λ > 1 It suffices to prove that, when λ > 1,

1 − Φ(λa) = o¡1 − Φ(a)¢ as a → ∞ When a > 0,

Trang 16

When λ > 1,

1 − Φ(λa)

1 − Φ(a) ≤

ϕ(λa)/(λa) aϕ(a)/(1 + a2) =

Since we only need to observe the asymptotic convergence of the ratio P (X >

a, Y > a)/¡1 − Φ(a)¢, tight bounds for the term P (X > a, Y > a) will be

unnec-essary Furthermore, Lemma 2.1 is applied to standard normal random variables,while the method shown in the proof of above lemma can also be applied to ran-dom variables which converge weakly to standard normal random variables For

example, if we have random variables X n ⇒ N(0, 1) and Y n ⇒ N(0, 1), we may get

asymptotic results similar to Lemma 2.3 We will discuss this in the later chapters

We notice that to derive the asymptotic convergence of the ratio P (X > a, Y >

a)/¡1 − Φ(a)¢, the correlation of X and Y should be strictly bounded between −1 and 1 If we have a sequence of correlated random variables {Z1, Z2, · · · , Z d }, it is

not realistic to check the correlations of every two nonidentical random variablesone by one In Proposition 2.6 below, we will show that a non-singular covariance

matrix of {Z1, Z2, · · · , Z d } implies that the correlations of every two nonidentical

random variables are not equal to either 1 or -1 To prove this argument, we recall

a well known fact below

Theorem 2.5 Let X and Y be two random variables and r be their correlation,

then |r| = 1 if and only if there exists constants a, b such that Y = aX + b with probability 1.

Then, we give our proposition as follows

Proposition 2.6 Let Z1, Z2, · · · , Z d be random variables with mean 0 and ance 1 Let Σ and R be the covariance matrix and correlation matrix respectively.

Trang 17

vari-If Σ is non-singular, all non-diagonal entries of R are strictly bounded between −1 and 1, i.e −1 < r ij < 1 for i 6= j, where r ij is the (i, j)-entry of R.

Proof If there exist Z i and Z j such that r ij = corr(Z i , Z j) = 1, Theorem 2.5

implies that there exist constants a and b, such that Z j = aZ i + b with probability

1 Together with the conditions EZ i = EZ j = 0 and Var(Z i ) = Var(Z j)=1, we

have Z i = Z j It is known that

matrix

With the above Lemma 2.3 and Proposition 2.6, we shall introduce the mainresult of this chapter This theorem presents the asymptotic tail distributions forboth maximum and minimum over a sequence of normal random variables

Theorem 2.7 Let (Z1, · · · , Z d ) be a random vector with multivariate normal

dis-tribution with a mean vector 0 and a non-singular covariance matrix Σ Assume further that variance of Z i , 1 ≤ i ≤ d is 1 Then

P ( max

1≤i≤d Z i > c) ∼ d¡1 − Φ(c)¢ as c → +∞, and

P ( min

1≤i≤d Z i ≤ c0) ∼ dΦ(c0) as c0 → −∞.

Trang 18

Proof We shall give two proofs to this theorem.

(Proof I) Recall the Bonferroni’s inequalities For any sets A1, A2, · · · , A d,

with zero mean Lemma 2.3 gives that

Trang 19

Since (Z1, Z2, · · · , Z d ) has zero mean, (−Z1, −Z2, · · · , −Z d) has the same

Therefore, Theorem 2.7 is obtained

For a set of independent events, if the probabilities for these events to occur

are very small, we call them rare events Suppose there are n independent events each with probability p i of occurring, 1 ≤ i ≤ n, and p i tends to zero Then, for

k = 0, 1, · · · , the probability for exactly k of these events to occur is approximately

equal to e −λ λ k /k!, where λ =Pi p i This is known as the Poisson limit theorem

It leads to an important fact that the probability for at least one event to occur is

equal to 1 − e −λ

Therefore, it is quite natural to think of using Poisson Approximation here

In this section, we would like to provide another method to prove Theorem 2.7,which employs the technique of Poisson Approximation associated with the Chen-Stein method In 1975, Chen first applied Stein’s method (Stein (1972)) to Poissonapproximation problems, and obtained error bounds when approximating sums ofdependent Bernoulli random variables with Poisson distribution The Chen-Steinmethod has been successfully developed in the past 30 years and resulted in lots

of interesting applications (See e.g Barbour and Chen (2005a, b))

In Poisson approximation problems, we use the total variation distance to showhow one random variable approximates the other The total variation distance

Trang 20

between two distributions is defined as

||L(X) − L(Y )|| = sup

A |P (X ∈ A) − P (Y ∈ A)|

where X and Y are random variables.

Suppose {X α : α ∈ J} are dependent Bernoulli random variables with index set J Denote the probabilities of occurring as

Theorem 2.8 The total variation distance between the distribution of W , L(W ),

and the Poisson distribution with mean λ, P o(λ), is

Trang 21

With Theorem 2.8, we now introduce the second proof of Theorem 2.7.

Proof (Proof II) Let the finite state space be J = (1, · · · , d), and the indicator of

the event {Z i > c} be X i = I(Z i > c) Suppose

Next, we shall apply Theorem 2.8 Take B i , the neighborhood of X i, to be the

whole index set in state space Then b3, given by equation (2.8), becomes 0, and

Trang 22

Therefore, equation (2.9) becomes

Similar to the arguments referring to max Z i, we shall define the indicators

Y i = I(Z i ≤ c0), q i = P (Y i = 1) and U = Pd i=1 Y i Then the event {min i Z i ≤ c0}

becomes {U ≥ 1} Since Pi q i tends to zero, it yields that

Remark 2.9 As one can see the proof using Bonferroni’s inequalities is more

ap-proachable and easier to understand However, we still keep the second proof inthis section, because it is an interesting application to the Poisson approximationassociated with the Chen-Stein method

Trang 23

Chapter 3

Asymptotic Results of Words in DNA

n → ∞ (so-called moderate deviations).

Theorem 3.1 Let X1, X2, · · · be a sequence of m-dependent random variables with EX k = 0 and E|X k | p < ∞, p = 2 + c2

0, for some c0 > 0, and let q = min(p, 3) Take B2

n = ES2

n , S n = X1 + · · · + X n Then in the interval 1 ≤ x ≤ cpln B2

n ,

17

Trang 24

0 < c ≤ c0, we have

1 − F n (x)

1 − Φ(x) =

F n (−x) Φ(−x) = 1 + O

With additional assumptions, we may get the ratio ¡1 − F n (x)¢/¡1 − Φ(x)¢

asymptotically equals to 1, as shown in the following theorem

Theorem 3.2 Let X1, X2, · · · be a sequence of m-dependent random variables with EX k = 0 and E|X k | 2p ≤ C p < ∞, p = 2 + c2

0 for some c0 > 0 Then, for

Trang 25

Finally, we consider R2 From above discussions, we only need to explore the

convergence of m q−1 x 2p−q L n,q When q = p, x 2p−q L nq = x p L np → 0 as has been

proved earlier When q = 3, using a similar argument, we have that

Trang 26

In above theorems, the moderate deviations for m-dependent random variables are studied We know that, when B n ³ n is satisfied, the tail probability of S n /B n

is approximated by the tail probability of a standard normal distribution when x and n tend to infinity with x ≤ c √ ln n.

To obtain the asymptotic results in over and under representation of DNA wordcounts, we shall study the distribution functions of word counts in DNA sequencesand normality of these word counts Theorem 3.2 shall be applied However, tobegin with, it is crucial to investigate the properties of the means and covariancematrix for a set of word counts Then we will easily obtain a set of centered andstandardized scores of the word counts

Now, consider a DNA sequence A = A1A2· · · A n with identically

independent-distributed letters from state space A = (A, C, G, T ) The probabilities for picking

up letters from A are (p A , p C , p G , p T ) Let u = u1u2· · · u k and v = v1v2· · · v k be

two k-tuple words with letters u i , v i ∈ Λ The indicator that the word u occurs at

the starting position i in the sequence A1· · · A n is denoted by

Iu (i) = I(A i A i+1 · · · A i+k−1 = u).

it is easy to derive that

E(I u (i)) = p u1p u2· · · p u k

Let π u = p u1p u2· · · p u k It is obvious that π u is independent to the exact positions

of the letters in the word u.

In the following two theorems, we shall not only prove the existence of the iting means and covariances of word counts, but also give their exact expressions.With these expressions, our computation for the score functions will be facilitated

Trang 27

lim-Theorem 3.3 For an i.i.d DNA sequence A1A2· · · A n and a word u, we let N u (n)

be its word count Then

To calculate the covariance of N u and N v, we must notice the dependence

between u and v when they overlap Define an overlap bit β u,v (j) = I(u j+1 =

v1, · · · , u k = v k−j ) When 0 ≤ j − i < k, we have

E¡Iu (i)I v (j)¢= p u1· · · p u j−i p u j−i+1 · · · p u k β u,v (j − i)p v k−j+i+1 · · · p v k

= π u β u,v (j − i)p v k−j+i+1 · · · p v k

When j − i ≥ k, u and v do not overlap Then

Ngày đăng: 30/09/2015, 14:23

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm