... score defined above in equation (1.1), and analyze the over- and underrepresentation of a finite set of DNA words as the sequence length goes to infinity, by investigating the behavior of the extrema... the tail probability of a standard normal distribution when x √ and n tend to infinity with x ≤ c ln n To obtain the asymptotic results in over and under representation of DNA word counts, we... under M0 38 4.2 DNA Sequences under M1 42 iv Summary Identifying over- and under- represented words is often useful in extracting information of DNA
Trang 1DEPARTMENT OF MATHEMATICS
NATIONAL UNIVERSITY OF SINGAPORE
2005
Trang 2Firstly, I would like to express my sincere thanks to my advisor, Professor Chen,Louis H Y for his help and guidance in these two years Professor Chen suggestedthe research topic on over- and under-representation of words in DNA sequences to
me, which is interesting and inspiring From the research work, I not only gainedknowledge of basics of Computational Biology, but also learned many importantmodern techniques in probability theories, like the Chen-Stein method Meanwhile,Professor Chen gave me precious advice on my research work and taught me how
to think rigorously He always encouraged me to open my mind and discover newmethods independently The academic training I acquired during these two yearswill greatly benefit my future research
I would also like to thank Professor Shao Qiman for his inspiring suggestionwhich leads to Remark 2.4 and finally to the proof of Theorem 3.9; and AssociateProfessor Choi Kwok Pui, who helped me revise the first draft of my thesis andgave me many valuable suggestions
My thank also goes to Mr Chew Soon Huat David, who gave me guidance
in conducting computer simulations; Mr Lin Honghuang for helping me generateDNA sequences and compute word counts when conducting the simulations; and
ii
Trang 3Mr Dong Bin, for giving me advise in revising this thesis.
Finally, I would like to thank Mr Chew Soon Huat David again for providingthis wonderful Latex template of thesis
Trang 43 Asymptotic Results of Words in DNA 17
3.1 Tail Probabilities of Extrema of Sums of m-dependent Variables 173.2 Asymptotic Normality of Markov Chains 27
4.1 DNA Sequences under M0 384.2 DNA Sequences under M1 42
iv
Trang 5Identifying over- and under-represented words is often useful in extracting formation of DNA sequences Since the criteria of defining these over and underrepresented words is somewhat ambiguous, we shall focus on the words of maximaland minimal occurrences, which will be definitely regarded as over- and under-represented words respectively In this thesis, we study the tail probabilities ofthe extrema over a finite set of standard normal random variables by using tech-niques like Bonferroni’s inequalities and Poisson Approximation associated withthe Chen-Stein method We apply similar techniques and the moderate deviations
in-of m-dependent random variables together, and then derive the asymptotic tail
probabilities of extrema over a set of word occurrences under M0 model Thestatistical distribution of word counts is also studied We show the asymptoticnormality of word counts under both the M0 and M1 models Finally we use com-puter simulations to study the tail probabilities of the most frequently and mostrarely occurred DNA words under both the M0 and M1 models The asymptoticresults under the M1 model are shown to be similar to those for the M0 model
v
Trang 64.1 Normal Q-Q plot of the sums of ξ scores of all 64 3-tuple words in
20,000 simulated DNA sequences under M0 model 39
4.2 Normal Q-Q plot of the maxima of ξ scores of all 64 3-tuple words
in 20,000 simulated DNA sequences under M0 model 40
4.3 Point-to-point plots of values of (a) Fmax(x) versus G(x), where Fmax(x) stands for the estimated probabilities b P (M ≥ x) and
G(x) = 64¡1 − Φ(x)¢; (b) Fmin(x) versus G(x), where Fmin(x)
stands for the estimated probability bP (m ≤ x) with G(x) = 64Φ(x). 41
4.4 Normal Q-Q plot of the sums of ξ scores of all 64 3-tuple words in
20,000 simulated DNA sequences under M1 model 43
4.5 Point-to-point plots of values of (a) Fmax(x) versus G(x); (b) Fmax(x) versus G(x); (c) Fmin(x) versus G(x) . 45
vi
Trang 7Chapter 1
Introduction
The analysis of rarity and abundance of words in DNA sequences has alwaysbeen of interest in biological sequence analysis A direct way to observe whether
a DNA word occurs rarely (or frequently) in a genome is to analyze the number
of its occurrences in a DNA sequence For a DNA sequence A1A2· · · A n with
A i ∈ A = {A, C, G, T }, we define the word count (e.g Waterman (1995)) N u for
a k-tuple word u as follows,
I(A i = u1, A i+1 = u2, · · · , A i+k−1 = u k ),
where Iu (i) is the indicator of the word u occurring at the starting position A i
To determine whether a DNA word u is rare or abundant in a DNA sequence,
one needs to introduce a probability model first Typical models, such as the
stationary m-order Markov chains, have been widely considered in the literature
(Reinert et al.(2000)) In this thesis, two models for DNA sequences will be ered One is called M0 model, for which all letters are independently and identically
consid-distributed; and the other one is called M1 model, for which {A1, A2, · · · } forms a
stationary Markov chain of order 1
In order to analyze word count N u, naturally, we shall first study the possiblestatistical distribution of it for a given model of the underlying DNA sequence We
1
Trang 8adopt a commonly used standardized score:
z u = Npu − EN u
Var(N u) (1.1)
where EN u and Var(N u ) are the mean and variance of N u respectively (Leung
et al.(1996)) Statistical distribution of the word count N u has already been wellstudied in the literature Waterman (1995) (Chapter 12) showed that the joint
distribution of a finite set of z scores can be well approximated by a multivariate
normal distribution under M1 model Several research works aim at identifyingover- and under-represented words in DNA or palindromes A word is called over-(or under-)represented if it is observed more (or less) frequently than expectedunder some specified probability model (Phillips et al (1987)) Leung et al.(1996)
identified over- and under-represented short DNA words by ranking their z Lscores
(maximum likelihood plug-in z scores) in a specific genome Chew et al (2003)
studied the over- and under-representation of the accumulative counts of all
palin-dromes of certain length by identifying their upper and lower 5% z scores of a
standard normal distribution In these studies, the criteria one used to identifythe over- (or under-)representation were different Indeed, for different purposes
in biological studies, the criteria would be different in general There is no singleuniversal way to determine whether a given word is over- (or under-)represented.However, if we consider the extreme case, i.e if we only take the words of maxi-mal and minimal occurrences, these two words are surely the over-represented andthe under-represented ones respectively (which is exactly what we will do in thisthesis)
In this thesis, we shall apply the ξ scores which is essentially the same as the z score defined above in equation (1.1), and analyze the over- and under-
representation of a finite set of DNA words as the sequence length goes to infinity,
by investigating the behavior of the extrema over their ξ scores.
We shall study the asymptotic results of ξ scores Generally, the DNA sequence
Trang 9are long and asymptotic results may be of relevance to the statistical analysis ofthe word counts For this, we introduce the following notations.
Assuming that the DNA sequence is modelled by M0, we shall show that (see
Theorem 3.9) if there exists a finite set of ξ scores {ξ1, ξ2, · · · , ξ d }, we have
as n → ∞ and x → ∞ with 1 ≤ x ≤ c √ ln n, provided that the covariance matrix
of word counts is non-singular Here, Φ and ϕ respectively denote the distribution
function and the density function of a standard normal random variable Whenassuming the DNA sequence is M1, we will prove the asymptotic normality of
the joint distribution of ξ scores by applying a central limit theorem for random
variables under mixing condition (Billingsley (1995), Section 27) Unfortunately,under the M1 model, the convergence of the ratios P (max i ξ i >x)
d
¡
1−Φ(x)¢ and P (min i ξ i ≤−x)
dΦ(−x) to
1 for ξ scores remain unsolved.
This thesis is organized as follows Chapter 2 shows how the distribution tions of extrema of a finite set of correlated standard normal random variablesbehave when these extrema tend to extremely large or small values In Chapter 3,the asymptotic convergence of the tail probabilities of extrema is established forword counts under M0 model It is also devoted to study the asymptotic normality
func-of word counts under M0 and M1 models Results func-of simulations are presented in
Trang 10Chapter 4, which support the asymptotic results given by Theorem 3.8 and showthe possibility that similar results will be obtained under M1 model.
Trang 11Chapter 2
Extrema of Normal Random Variables
In this chapter, we would like to investigate the distributions of both the mum and minimum of a set of standard normal random variables More precisely,
maxi-we will try to find out the probabilities of the maximum being greater than c, and the probabilities of the minimum being less than c0, for c, c0 ∈ R Our main
theorem in this chapter shows that, when c is large enough and c0 is small enough,the asymptotic tail distributions of the both extrema follow certain expressions in
terms of c and c0respectively We will present two methods in proving this theorem,one using Bonferroni’s inequalities and the other one using Poisson approximationassociated with the Chen-Stein method
To facilitate the proof of Theorem 2.7, we need a few lemmas first The firstlemma was given by Barbour et al.(1992) To make this thesis self-contained, weshall provide its proof, which is essentially the same as that of Barbour et al.(1992)
Throughout this section, we assume the correlation of two random variables X and
Y , r, be strictly bounded between -1 and 1, i.e −1 < r = corr(X, Y ) < 1.
5
Trang 12Lemma 2.1 Let (X, Y ) be jointly normally distributed with mean vector 0 and
If −1 < r ≤ 0, the inequalities are reversed.
(ii) If 0 ≤ r < 1, then for any nonpositive a and b,
If −1 < r ≤ 0, the inequalities are reversed.
Proof For part (i),
Trang 13If 0 ≤ r < 1, we get the lower bound immediately Next, we want to prove that
the function 1−Φ(x) ϕ(x) is decreasing Let f (x) = 1−Φ(x) ϕ(x) Then f 0 (x) = −1 + x
which gives the upper bound Due to equation (2.1), the lower and upper bounds
are reversed when r < 0 Hence, the same argument can be used to derive the reversed inequalities for −1 < r ≤ 0.
For part (ii), since
P (X > a, Y > b) = P (−X < −a, −Y < −b)
= P (X < −a, Y < −b) = P (X ≤ −a, Y ≤ −b), the same argument works when we take a and b to be nonpositive And the
Trang 14As a result, the inequalities are established when 0 6 r < 1 for nonpositive
a and b The same argument works when −1 < r ≤ 0, with the inequalities
If −1 < r ≤ 0, the inequalities are reversed.
(ii) If 0 6 r < 1, then for any nonpositive a and b,
If −1 < r ≤ 0, the inequalities are reversed.
Proof This lemma is the direct result of Lemma 2.1 by substituting b into a.
The above two lemmas give the exact expressions of the lower and upper bounds
of probability P (X > a, Y > a) Next, we would like to find out the asymptotic
behavior of P (X > a, Y > a) as a tends to infinity The rate of convergence of
P (X > a, Y > a) is also expected in following lemma.
Lemma 2.3 Let (X, Y ) be jointly normally distributed with mean vector 0 and
Trang 15Proof When a → ∞, 1 − Φ(
q
1−r 1+r a) → 0 Immediately, by applying the squeeze
theorem, Lemma 2.2 yields equations (2.2) and (2.3)
Remark 2.4 The upper and lower bounds obtained by Lemma 2.1 are vary tight
that will refine the error bounds in normal approximation problems However, it
is not necessary to use such tight bounds to prove Lemma 2.3 as can be seen asfollows
Since X and Y are jointly normal, then X + Y is normal with E(X + Y ) = 0 and Var(X + Y ) = 2(1 + r) Hence √ (X+Y )
2(1+r) is standard normal Therefore,
1+r Obviously, we have λ > 1 It suffices to prove that, when λ > 1,
1 − Φ(λa) = o¡1 − Φ(a)¢ as a → ∞ When a > 0,
Trang 16When λ > 1,
1 − Φ(λa)
1 − Φ(a) ≤
ϕ(λa)/(λa) aϕ(a)/(1 + a2) =
Since we only need to observe the asymptotic convergence of the ratio P (X >
a, Y > a)/¡1 − Φ(a)¢, tight bounds for the term P (X > a, Y > a) will be
unnec-essary Furthermore, Lemma 2.1 is applied to standard normal random variables,while the method shown in the proof of above lemma can also be applied to ran-dom variables which converge weakly to standard normal random variables For
example, if we have random variables X n ⇒ N(0, 1) and Y n ⇒ N(0, 1), we may get
asymptotic results similar to Lemma 2.3 We will discuss this in the later chapters
We notice that to derive the asymptotic convergence of the ratio P (X > a, Y >
a)/¡1 − Φ(a)¢, the correlation of X and Y should be strictly bounded between −1 and 1 If we have a sequence of correlated random variables {Z1, Z2, · · · , Z d }, it is
not realistic to check the correlations of every two nonidentical random variablesone by one In Proposition 2.6 below, we will show that a non-singular covariance
matrix of {Z1, Z2, · · · , Z d } implies that the correlations of every two nonidentical
random variables are not equal to either 1 or -1 To prove this argument, we recall
a well known fact below
Theorem 2.5 Let X and Y be two random variables and r be their correlation,
then |r| = 1 if and only if there exists constants a, b such that Y = aX + b with probability 1.
Then, we give our proposition as follows
Proposition 2.6 Let Z1, Z2, · · · , Z d be random variables with mean 0 and ance 1 Let Σ and R be the covariance matrix and correlation matrix respectively.
Trang 17vari-If Σ is non-singular, all non-diagonal entries of R are strictly bounded between −1 and 1, i.e −1 < r ij < 1 for i 6= j, where r ij is the (i, j)-entry of R.
Proof If there exist Z i and Z j such that r ij = corr(Z i , Z j) = 1, Theorem 2.5
implies that there exist constants a and b, such that Z j = aZ i + b with probability
1 Together with the conditions EZ i = EZ j = 0 and Var(Z i ) = Var(Z j)=1, we
have Z i = Z j It is known that
matrix
With the above Lemma 2.3 and Proposition 2.6, we shall introduce the mainresult of this chapter This theorem presents the asymptotic tail distributions forboth maximum and minimum over a sequence of normal random variables
Theorem 2.7 Let (Z1, · · · , Z d ) be a random vector with multivariate normal
dis-tribution with a mean vector 0 and a non-singular covariance matrix Σ Assume further that variance of Z i , 1 ≤ i ≤ d is 1 Then
P ( max
1≤i≤d Z i > c) ∼ d¡1 − Φ(c)¢ as c → +∞, and
P ( min
1≤i≤d Z i ≤ c0) ∼ dΦ(c0) as c0 → −∞.
Trang 18Proof We shall give two proofs to this theorem.
(Proof I) Recall the Bonferroni’s inequalities For any sets A1, A2, · · · , A d,
with zero mean Lemma 2.3 gives that
Trang 19Since (Z1, Z2, · · · , Z d ) has zero mean, (−Z1, −Z2, · · · , −Z d) has the same
Therefore, Theorem 2.7 is obtained
For a set of independent events, if the probabilities for these events to occur
are very small, we call them rare events Suppose there are n independent events each with probability p i of occurring, 1 ≤ i ≤ n, and p i tends to zero Then, for
k = 0, 1, · · · , the probability for exactly k of these events to occur is approximately
equal to e −λ λ k /k!, where λ =Pi p i This is known as the Poisson limit theorem
It leads to an important fact that the probability for at least one event to occur is
equal to 1 − e −λ
Therefore, it is quite natural to think of using Poisson Approximation here
In this section, we would like to provide another method to prove Theorem 2.7,which employs the technique of Poisson Approximation associated with the Chen-Stein method In 1975, Chen first applied Stein’s method (Stein (1972)) to Poissonapproximation problems, and obtained error bounds when approximating sums ofdependent Bernoulli random variables with Poisson distribution The Chen-Steinmethod has been successfully developed in the past 30 years and resulted in lots
of interesting applications (See e.g Barbour and Chen (2005a, b))
In Poisson approximation problems, we use the total variation distance to showhow one random variable approximates the other The total variation distance
Trang 20between two distributions is defined as
||L(X) − L(Y )|| = sup
A |P (X ∈ A) − P (Y ∈ A)|
where X and Y are random variables.
Suppose {X α : α ∈ J} are dependent Bernoulli random variables with index set J Denote the probabilities of occurring as
Theorem 2.8 The total variation distance between the distribution of W , L(W ),
and the Poisson distribution with mean λ, P o(λ), is
Trang 21With Theorem 2.8, we now introduce the second proof of Theorem 2.7.
Proof (Proof II) Let the finite state space be J = (1, · · · , d), and the indicator of
the event {Z i > c} be X i = I(Z i > c) Suppose
Next, we shall apply Theorem 2.8 Take B i , the neighborhood of X i, to be the
whole index set in state space Then b3, given by equation (2.8), becomes 0, and
Trang 22Therefore, equation (2.9) becomes
Similar to the arguments referring to max Z i, we shall define the indicators
Y i = I(Z i ≤ c0), q i = P (Y i = 1) and U = Pd i=1 Y i Then the event {min i Z i ≤ c0}
becomes {U ≥ 1} Since Pi q i tends to zero, it yields that
Remark 2.9 As one can see the proof using Bonferroni’s inequalities is more
ap-proachable and easier to understand However, we still keep the second proof inthis section, because it is an interesting application to the Poisson approximationassociated with the Chen-Stein method
Trang 23Chapter 3
Asymptotic Results of Words in DNA
n → ∞ (so-called moderate deviations).
Theorem 3.1 Let X1, X2, · · · be a sequence of m-dependent random variables with EX k = 0 and E|X k | p < ∞, p = 2 + c2
0, for some c0 > 0, and let q = min(p, 3) Take B2
n = ES2
n , S n = X1 + · · · + X n Then in the interval 1 ≤ x ≤ cpln B2
n ,
17
Trang 240 < c ≤ c0, we have
1 − F n (x)
1 − Φ(x) =
F n (−x) Φ(−x) = 1 + O
With additional assumptions, we may get the ratio ¡1 − F n (x)¢/¡1 − Φ(x)¢
asymptotically equals to 1, as shown in the following theorem
Theorem 3.2 Let X1, X2, · · · be a sequence of m-dependent random variables with EX k = 0 and E|X k | 2p ≤ C p < ∞, p = 2 + c2
0 for some c0 > 0 Then, for
Trang 25Finally, we consider R2 From above discussions, we only need to explore the
convergence of m q−1 x 2p−q L n,q When q = p, x 2p−q L nq = x p L np → 0 as has been
proved earlier When q = 3, using a similar argument, we have that
Trang 26In above theorems, the moderate deviations for m-dependent random variables are studied We know that, when B n ³ n is satisfied, the tail probability of S n /B n
is approximated by the tail probability of a standard normal distribution when x and n tend to infinity with x ≤ c √ ln n.
To obtain the asymptotic results in over and under representation of DNA wordcounts, we shall study the distribution functions of word counts in DNA sequencesand normality of these word counts Theorem 3.2 shall be applied However, tobegin with, it is crucial to investigate the properties of the means and covariancematrix for a set of word counts Then we will easily obtain a set of centered andstandardized scores of the word counts
Now, consider a DNA sequence A = A1A2· · · A n with identically
independent-distributed letters from state space A = (A, C, G, T ) The probabilities for picking
up letters from A are (p A , p C , p G , p T ) Let u = u1u2· · · u k and v = v1v2· · · v k be
two k-tuple words with letters u i , v i ∈ Λ The indicator that the word u occurs at
the starting position i in the sequence A1· · · A n is denoted by
Iu (i) = I(A i A i+1 · · · A i+k−1 = u).
it is easy to derive that
E(I u (i)) = p u1p u2· · · p u k
Let π u = p u1p u2· · · p u k It is obvious that π u is independent to the exact positions
of the letters in the word u.
In the following two theorems, we shall not only prove the existence of the iting means and covariances of word counts, but also give their exact expressions.With these expressions, our computation for the score functions will be facilitated
Trang 27lim-Theorem 3.3 For an i.i.d DNA sequence A1A2· · · A n and a word u, we let N u (n)
be its word count Then
To calculate the covariance of N u and N v, we must notice the dependence
between u and v when they overlap Define an overlap bit β u,v (j) = I(u j+1 =
v1, · · · , u k = v k−j ) When 0 ≤ j − i < k, we have
E¡Iu (i)I v (j)¢= p u1· · · p u j−i p u j−i+1 · · · p u k β u,v (j − i)p v k−j+i+1 · · · p v k
= π u β u,v (j − i)p v k−j+i+1 · · · p v k
When j − i ≥ k, u and v do not overlap Then