... the position of occurrence Due to the many areas of applications of runs and patterns, different statistics of runs and patterns have been defined for various purposes Occurrence of runs, distance... small number of runs, or clusters of same symbols, or by runs of unexpected length Thus the total number of runs and the lengths of the runs should reflect the existence of some sort of pattern... consisting of 1s and 0s In [7], the analysis of nucleotide sequences leads to the analysis of runs and patterns on the alphabet set {a, t, c, g} The authors find the mean and variance of the number of
Trang 1Lim Soon Kit
An academic exercise presented in partial fulfillment for the degree
of Master of Science in Mathematics.
Supervisor : Dr Zhang Louxin
Department of MathematicsNational University of Singapore
2003/2004
Trang 2Acknowledgements iii
1.1 Bioinformatics 2
1.2 Non-parametric Statistics 5
1.3 Psychology, Ecology and Meteorology 7
1.4 Statistical quality control and charts 8
1.5 Reliability 9
1.6 Radar astronomy 11
1.7 Sociology 12
1.8 Formulation of the run statistics problems 12
2 Success runs 16 2.1 Probability of occurrence 17
2.2 Mean waiting time for first occurrence 24
i
Trang 34.3 Bounds on β and λ 55
ii
Trang 4I would like to take this opportunity to express my gratitude to my supervisor,
Dr Zhang Louxin for his help and guidance throughout this academic cise His advice on different aspects of life, not just academically, was muchappreciated
exer-iii
Trang 5A run or pattern is a specified sequence of outcomes that may occur at some point in the series of trials Runs containing a single symbol are called suc- cess runs Runs containing 2 symbols are called success-failure runs Runs containing more than 2 symbols are called multiple runs.
In Chapter 1, we define various statistics based on runs and patterns andlook at how the analysis of runs and patterns is being used in practical appli-cations
In Chapter 2, we will derive the probability of occurrence of a sucess run of
length l before or at position n, denoted by Q n, through a recurrence relation
in its complement Q n The probability of the first occurrence of a sucess run of
length l at position n, denoted by f n, is also obtained We will also obtain themean waiting time for the first occurence of a success run and the mean andvariance of the number of occurrences
In Chapter 3, we look at success-failure runs and discuss how to obtain themean and variance of the number of occurrences of such runs We also defineand obtain expressions for the ‘distance’ between occurrences of success-failure
iv
Trang 6In Chapter 4, we study spaced patterns where a spaced pattern is denoted
by a specified sequence on the set {1, * }, and the 1s correspond to the
match-ing positions and 0s correspond to the ‘don’t care’ positions We will derivethe probability of occurrence of a spaced pattern through a set of recurrencerelations We will also look at some asymptotic results for approximating theprobability of occurrence of spaced patterns
v
Trang 7Introduction and background
Consider a series of n trials, x1, x2, , x n , where each trial has at least m ≥ 2 possible outcomes and x i ∈ A = {0, 1, , m} for all i A run or pattern is
defined to be a specified sequence of outcomes that may occur at some point
in the series of trials For example, let A = {1, 2, 3} and consider the runs
R1 = 1111, R2 = 1122 and R3 = 1231 Runs containing a single symbol, for
example R1, are called success runs Runs containing 2 symbols, for example
R2, are called success-failure runs Runs containing more than 2 symbols, for example R3, are called multiple runs We can also consider runs specified by the set A ∪ {∗} where the *s correspond to ‘don’t care’ positions We call such runs ‘spaced patterns’ For example, in the spaced pattern R4 = 12 ∗ 1, the ‘*’ can be any symbol from the alphabet set A.
The probabilistic analysis of runs and patterns plays an important role inmany statistical areas such as reliability of engineering systems, DNA sequenc-
1
Trang 8ing, nonparametric statistics, psychology, ecology and radar astrology In thenext few sections, we will look at how the analysis of runs and patterns is beingused in practical applications.
The sequence similarity of a newly discovered gene to known genes is often
an important clue to its function and structure By comparing DNA/proteinsequences, one can learn about functionality or the structure of proteins withoutperforming any experiments Therefore, whenever a biologist sequences a gene,the next thing to do is to search the sequence or protein databases for similarsequences To give a measure of the similarity between two DNA or protein
sequences, we need a definition: An alignment of two sequences x and y is a new pair of sequences x 0 and y 0 of equal length such that
1 x 0 and y 0 are obtained from x and y respectively by inserting occurrence
of the space symbol ‘-’;
2 no two space symbols lie in the same position in x 0 and y 0
For example, if x = abcbdc and y = accdbdb, then one alignment of x and y
is as follows:
Trang 9Furthermore, the above alignment contains 3 matches, 2 mismatches and 3insertions of ‘-’.
The quality of an alignment, that is, the degree to which it displays thesimilarity between the two sequences, is measured by a score Such a score
is given by the sum of scores associated with the individual columns of the
alignment The score of a column is given by a symmetric scoring function f that maps pairs of symbols from the alphabet A ∪ {−} to real numbers, where
A is {A, C, T, G} (for DNA sequences) or the set of amino acids Generally, we will have f (a, a) > 0 for all a ∈ A and f (a, b) < 0 for a 6= b, so that matched
columns increase the score of the alignment whereas mismatches and insertions
of ‘-’ are penalized Such a score scheme can be given in the form of a table,such as
database for high-scoring local alignments However, this would require a hugeamount of computational time Instead, a heuristic approach is used for this
Trang 10problem The BLAST (Basic Local Alignment Search Tool) programme first
find a reasonably long exact matches (consecutive k bases) between the given
sequence and a sequence in the database, and then extend these exact matchesinto local alignments The principal underlying this is that, based on statisticalstudy, two sequences are likely to have high-scoring local alignment only ifthere are reasonably long exact matches between them The larger the value
of k, the faster the programme but the poorer the sensitivity The value of
k is usually chosen to be 11 by considering tradeoff between search speed and
sensitivity Another programme used for database search is PatternHunter [13]
Unlike BLAST which looks for consecutive k matches, PatternHunter looks for nonconsecutive k matches given by a matching pattern.
More precisely, a specific set of matching positions is a spaced pattern of *s
and 1s, where the 1s gives the matching positions For example, if we used thespaced pattern 1*1*1, then we are only looking for matches in the first, third
and fifth positions The two sequences AT CGACC and AGCT ACC contain 2
matches of the spaced pattern 1*1*1, ending at the fifth and seventh positionsgiven below:
Trang 11posi-The sensitivity of a spaced pattern S is defined as the probability of the occurrence of S in a random 0-1 sequence of fixed length N = 64 Both the pattern and the number of 1s in S determine its sensitivity For example, the
spaced pattern 111*1**11**1*1*111 has a probability of occurrence of 0.712 for
a 64-bit random string, while the pattern 11111111111 used by BLAST has only
a probability of 0.3 Theoretical analysis of the sensitivity of spaced patterns
is important both in theory and in practical applications
Nonparametric test is a test for a hypothesis which is not a statement aboutparameter values The type of statement permissible then depends on thedefinition accepted for the term parameter The hypothesis for a nonparametrictest can only be related to the form of the population, such as in goodness-of-fit tests, or with some characteristic of the probability distribution of thesample data, such as in tests of trend and randomness, and for identical sampledpopulations
Suppose that on some day during lunch time, a queue of fifteen personswaiting in line to get into a certain restaurant is observed There are eightmales(M) and seven females(F) forming a line in the arrangement:
M, F, M, F, M, F, M, F, M, F, M, F, M, F, MWould this be considered a random arrangement by sex? Intuitively, the answer
Trang 12is no, since the males and females seem to alternate, suggesting intentionalmixing by pairs This arrangement is an extreme case, just like M, M, M, M,
M, M, M, M, F, F, F, F, F, F, F with intentional clustering In the less extremesituations, the randomness of an arrangement can be tested statistically usingthe theory of runs
Given an ordered sequence of two or more types of symbols, a run is defined
to be a succession of one or more identical symbols which are followed andpreceded by a different symbol or no symbol at all Hence in the analysis ofstrings and patterns, a run is simply a consecutive pattern of length at leastone Clues to lack of randomness are provided by the tendency of symbols toexhibit a definite pattern in the sequence If there is sequential dependencyamong symbols of the same type, then they may tend to cluster and this may
be indicated by an unusually small number of runs, or clusters of same symbols,
or by runs of unexpected length Thus the total number of runs and the lengths
of the runs should reflect the existence of some sort of pattern Hence the twocriterion can be used to test for randomness Too few runs, too many runs, arun of excessive length, or too many runs of excessive length etc, can be used
as statistical criteria for the rejection of the null hypothesis of randomness asthese situations should occur rarely if the sequence is truly random
Trang 131.3 Psychology, Ecology and Meteorology
The theory that success breeds success, that is, that attaining a positive come makes it more probable that a positive outcome will be attained onthe next trial, is often considered in psychological achievement testing, animallearning studies, athletic competition and similar matters It is also possiblethat failure breeds failure, or that both phenomena may be present
out-In animal learning experiments, a test animal performs a sequence of trials,
in each of which it either succeeds or fails at selecting a box containing food, atrunning a maze or at some other tasks By observing the length of the longestrun, a psychologist can test for improvement in the animal’s performance
In athletic competition, a psychologist may make the hypothesis that ning a trophy one year increases an athletic team’s motivation to win it the nextyear The psychologist obtains a sample of the list of wins(W) and losses(L) of
win-an win-annual trophy for a certain team, win-and writing it in the form of a sequence,for example:
WWWLLWWWWLLLLWWLLLWWWWWWThen a nonparametric test based on the length of the longest run can be usedaccept or reject the hypothesis
In ecological studies on the distribution of some characteristics such asspecies type or the prevalence of a specific disease, it is quite common to take aline or belt transect and then observe the characteristics of the sampled trees
Trang 14The length of the longest run of trees having that specific characteristic is thenused for drawing some useful conclusions on the segregation of the species orthe spread of the disease.
Similar situations arise in meteorology when one is interested in checkingwhether there is a tendency towards the persistence of the same type of weather
Sampling inspection plans commonly used are those based on the familiar tailed inverse sampling scheme in which inspection stops when either a certainnumber of non-conforming items is completed or a prespecified number of items
cur-is sampled, whichever comes early The lot will be rejected in the cae of theformer, while the lot will be accepted in the case of the latter Such samplingplans are desirable as they are economical based on time and cost considera-tions The idea of using run as a stopping criterion in acceptance sampling wasintroduced From that time on, many run-based acceptance sampling planshave been proposed and investigated
In the application of statistical methods to quality control charts, a usualprocedure is to construct a control chart with control limits spaced about themean such that under conditions of statistical control, or random sampling,
the probability of an observation falling outside these limits is a given α, for example, α = 0.05.
The occurrence of a point outside these limits is taken as an indication of the
Trang 15presence of assignable causes of variation in the production line Such a form ofchart has been found to be of particular value in the detection of the presence
of assignable causes of variability in the quality of manufactured product Asrecently pointed out, however, the statistician may not only help to detect thepresence of assignable causes, but also help to discover the causes themselves
in the course of further research and development
For this purpose, runs of different kinds and of different lengths have beenfound useful by industrial statisticians Quality control engineers have foundthat a convenient indication of lack of control is the occurrence if long runs ofobservations whose values lie above or below that of the median of the sample
A consecutive k-out-of-n failure system is a system of components numbered
1 through n which fails if and only if at least k consecutively numbered
com-ponents fail A more formal and precise definition of this reliability system is
as follows Consider a system consisting of n components placed in a line and labeled as first, second, and so on, up to n-th Each component can only be in
one of two states, either operating (up/good/working) or failed (down/bad/not
working), and so aslo the entire system The n components are assumed to
work independently of each other Then, the whole system fails whenever at
least k consecutive components are in their failed state - that is, the system functions if and only if there exists no succession of k failed components Such
Trang 16systems arise in many different settings including telecommunications and puter networks In order to make the idea of this system clear, we now providetwo examples:
com-1 A system of n radar stations is used for transmitting information from
Site A to Site B Suppose that the stations are equally spaced betweenthe two sites and that each station is able to transmit signals up to a
distance of k microwave stations Then, the system clearly becomes
non-functional (unable to transmit information from Site A to Site B) if and
only if at least k radar stations are out of order;
2 A fluid transportation network uses pumps and pipes to carry the fluidfron Point A to Point B Suppose that the pump stations are equallyspaced between the two points and that each pump station is able to
transport fluid up to a distance of k pump stations If one pump is down,
the transportation of fluid is not hindered as the previous station can
overcome this difficulty However, when k or more consecutive pumps are
down, the transportation of fluid stops
Many different aspects and characteristics of such systems have been studiedunder various assumptions regarding the random variables describing the per-formance of the components Early work dealt with independent and identicallydistributed components and considered procedures for finding the reliability of
such systems When analyzing the reliability of a consecutive k-out-of-n failure
Trang 17system, the reliability can be viewed as the probability that the random
num-ber of runs of at least k consecutive failures in n independent Bernoulli trials
is zero Alternatively, the reliability can be viewed as the probability that the
waiting time until the k consecutive failure is greater that n.
In radar astronomical observations of minor planets or asteroids, data consist
of echo-power spectral-density estimates at a sequence of Doppler frequencies.Empirically determined background filter shape can be removed from the rawspectrum, and the resultant backgound-free spectrum normalized to the root-mean-square fluctuation in the receiver noise If no echo is present, the modelthat the spectral estimates behave as a sequence of independent and identicallydistributed Normal(0,1) random variables agrees with both a priori theoreticalconsiderations and a posteriori experimental evidence However, if the targethas a sufficiently large radar cross-section, a radar echo would be expected toproduce a sequence of above-average readings in some portion of the frequencyband A test for the presence of an echo can be based on the length of thelongest run of positive readings
Trang 181.7 Sociology
The behaviour of groups of people forming lines and other structure can bemodelled as a Markov-dependent sequence of trials in which some characteristcsuch as the sex of the individual is taken as the trial outcome It is of interest
to find out whether the occurrence of certain runs is plausible under varioustypes and orders of Markov dependence
For example, for a group of primary school children queueing up before a teen stall, a sociologist may wish to test the hypothesis that groupings of smallchildren are random with regard to sex against the alternative hypothesis thatsmall children of the same sex tend to congregate The total number of runs inthe sequence can be used as a nonparametric test for randomness
Recall that a run or pattern is defined to be a specified sequence of outcomes
that may occur at some point in the series of trials
Let x1, x2, , x n be a sequence of n trials, where x i ∈ A = {0, 1, , m} for all i and m ≥ 1 Since we are mainly interested in success runs, success-failure runs and spaced patterns, we will let m to be 1 throughout this thesis, that is , we will take A to be the set of binary bits {0, 1} Generally, the probabilities
of the m outcomes can vary arbitrarily from trial to trial, and can be L-order Markov dependent on the L preceding outcomes In this thesis, we will assume
Trang 19that the outcome probabilities at any trial are independent of the outcomes ofall previous trials.
It is necessary to address the question of how to count the number of times
a given run or pattern occurs in a sequence of trials For example, we want tocount the number of occurrences of the patterns 1010 and 1111, and the spacedpattern 1*1*1 in the sequence
1111010101111010.
If overlapping occurrences of 1010 are all relevant and counted, then the pattern
1010 is counted as occuring in positions 4-7, 6-9 and 13-16, thus occurring atotal of 3 times.Even though two of these patterns overlap in positions 6 and 7,both are counted If second and higher overlapping patterns are not counted,the pattern 1010 is counted as occurring only twice, namely in positions 4-7and 13-16 The pattern 1111 occurs twice in positions 1-4 and 10-13 withoutambiguity For the spaced pattern 1*1*1, we are only interested in the bit 1
to occurring at the first, third and fifth positions That is, any occurrences
of 10101, 11101, 10111 or 11111 in the sequence contributes to one count ofthe spaced pattern 1*1*1 if overlaps are allowed Therefore the spaced patten1*1*1 occurs a total of five times in the sequence at positions 2-6, 4-8, 6-10 8-
12 and 11-15, ignoring overlaps Hence it is clear that the number of times that
a given patttern occurs depends on whether overlaps are allowed For differentpurposes different definitions have been adopted For example, in [8], Fellerconsiders only non-overlapping counting for a recurrent event In [9], both
Trang 20overlapping and non-overlapping countings are considered It is also largely
a matter of convention and convenience whether we consider the starting orending position as the position of occurrence In the previous example, thepattern 1111 occurs in the sequence from positions 1 to 4 We say that thepattern 1111 occurs at position 1 if we consider the starting position as theposition of occurrence On the other hand, we can say that the pattern 1111occurs at position 4 if we consider the ending position instead as the position
of occurrence Throughout, we will take the ending position as the position ofoccurrence
Due to the many areas of applications of runs and patterns, different tics of runs and patterns have been defined for various purposes Occurrence
statis-of runs, distance between occurrences, length statis-of longest success runs and otherproblems have been studied in various papers In [8], the author obtains the
generating function of the waiting time for a run of l successes and derives the
related mean waiting time The probability that a success run occurs before
a failure run is also discussed In [9], probability distributions of the number
of success runs of size exactly k and the number of success runs of size greater than or equal to k, among others, are derived using the technique of Markov
chain imbedding In [7], the ‘distance’ between occurrences and the mean andvariance of the number of occurrences is obtained for the pattern 1010 Thedistance between two occurrences of a pattern is the difference between the twopositions of occurrences For example, in the sequence 1101001101, the pattern
Trang 2111 occurs twice and the distance between the two occurrences is 4 In [10], thelength of the longest success run and the total number of runs are mainly ofinterest in randomness testing.
In this thesis we will derive the probability of occurrence of a sucess run of
length l before or at position n, denoted by Q n, through a recurrence relation
in its complement Q n The probability of the first occurrence of a sucess run of
length l at position n, denoted by f n, is also obtained We will also obtain themean waiting time for the first occurence of a success run and the mean andvariance of the number of occurrences
Next, we will look at success-failure runs and discuss how to obtain themean and variance of the number of occurrences of such runs We also defineand obtain expressions for the ‘distance’ between occurrences of success-failureruns
Finally, we study spaced patterns where a spaced pattern is denoted by a
specified sequence on the set {1, * }, and the 1s correspond to the matching
positions and 0s correspond to the ‘don’t care’ positions We will derive theprobability of occurrence of a spaced pattern through a set of recurrence re-lations We will also look at some asymptotic results for approximating theprobability of occurrence of spaced patterns
Trang 22Success runs
In this chapter, we will derive the probability of occurrence of a sucess run of
length l before or at position n, denoted by Q n, through a recurrence relation
in its complement Q n The probability of the first occurrence of a sucess run
of length l at position n, denoted by f n, is also obtained
Let A = {0, 1} Recall that a success run is a run R consisting of a single symbol on A We will consider a success run R to be a pattern of consecutive 1s, that is, R = 11 · · · 1 = 1 l , where l is the length of R At any trial, the probability of the bit 1 occuring is denoted by p, and the probability of the bit
0 occuring is denoted by q = 1 − p We assume that the outcome probabilities
at any trial are independent of the outcomes of all previous trials Recall that,
if R = 1 l occurs at positions k to k + l − 1, we say that R occurs at position
k + l − 1, the ending position.
16
Trang 25f n = P (R first occurs at position n)
= P (R occurs before or at position n) − P (R occurs before or at position n − 1)
Trang 26Proof From Theorem 2.1 we get that
Proposition 2.1.3 For any n ≥ l, Q n ≥ Q l Q n−l
Proof The proof is by induction on n.
Trang 27When n = l, equality holds and hence the statement is true Now,
Q n+1 − Q l Q n+1−l = qQ n + qpQ n−1 + · · · + qp l−1 Q n−l+1 − Q l (qQ n−l + qpQ n−l−1 + · · ·
+qp l−1 Q n+1−2l)
= q(Q n − Q l Q n−l ) + qp(Q n−1 − Q l Q n−l−1 ) + · · · +qp l−1 (Q n−l+1 − Q l Q n+1−2l)
Proof We will prove (1) by induction on k.
For the base case k = l, the inequality holds by Proposition 2.1.3 Hence
Q n − Q k+1 Q n−(k+1)+l−1 = Q n − Q k+1 Q n−k+l−2
= qQ n−1 + qpQ n−2 + · · · + qp l−1 Q n−l
−Q n−k+l−2 (qQ k + qpQ k−1 + · · · + qp l−1 Q k+1−l)
= q(Q n−1 − Q k Q n−k+l−2 ) + qp(Q n−2 − Q k−l Q n−k+l−2 ) + · · · +qp l−1 (Q n−l − Q k+1−l Q n−k+l−2)
≥ 0
Trang 28by strong induction.
To prove (2), we let A i denote the event that R occurs at position i, and A i
the complement of A i Note that A i = ∅ for i = 1, 2, , l − 1 Hence we have
Proof Recall that
f n = P (R first occurs at position n)
= qp l Q n−l−1 Using the first inequality of Theorem 2.3 with n and k replaced by n − l − 1 and k − l − 1 respectively, we have
Q n−l−1 ≤ Q k−l−1 Q (n−l−1)−(k−l−1)+l−1
= Q k−l−1 Q n−k+l−1
Trang 29Multiplying both sides of the inequality by qp l, we get that
Trang 302.2 Mean waiting time for first occurrence
Let R = 1 l be a success run of length l The mean waiting time for first occurrence of R, or ‘recurrence times’ as mentioned in [8], is derived using
generating functions by Feller In this section, we will use a different method to
derive the mean waiting time Let v denote the position where R first occurs.
We will obtain E(v), the mean of v in terms of l and p, the probability of the
bit 1 occurring
Theorem 2.5 [8] E(v) = 1 − p
l
(1 − p)p l Proof By definition of the mean, we have
Trang 31Next we find an expression for
Trang 32Finally we get that
Trang 332.3 Mean and variance of number of
occur-rences (overlapping allowed)
In this section we will find the mean and variance of the total number of
oc-currences of R in a string of length n Define the indicator variable I j by
n
X
j=l
I j2)
+ 2(n − l) terms of the form E(I j I j+1)
+ 2(n − (l + 1)) terms of the form E(I j I j+2)
+
+ 2(n − (l + l − 2)) terms of the form E(I j I j+l−1)
+ (n − (2l − 1))(n − (2l − 2)) terms of the form E(I j I j+k ), k ≥ l Note that E(