The entry provides a wide variety of information about the transcription factor, such as the binding sequence motif SQ and the first SF and the last position ST of the factor binding site
Trang 11140 Gautam B Singh
Fig 59.2 The processing stages required for generating the Markovian models for DNA pat-terns
these elements and defines consensus and matrices for elements of certain function, and thus
to provide means of identifying regulatory signals in anonymous genomic sequences
TRANSFAC: Transcription Factors and Regulation
The development path for the TRANSFAC database has been geared by the objective to
pro-vide a biological context for understanding the function of regulatory signals found in genomic
sequences The aim of this compilation of signals was meant to provide all relevant data about the regulating proteins and allow researchers to trace back transcriptional control cascades to
their origin (Wingender et al., 2001, Matys et al., 2003) The TRANSFAC database contains
information about regulatory DNA sequences and the transcription factors binding to and act-ing through them At the core of this database are its components describact-ing the transcription factor (FACTOR) and its corresponding binding site (SITE) and the regulation of the cor-responding gene (GENE) The GENE table is one of the central tables in this database It
is linked to several other databases including S/MARtDB (Liebich et al., 2002), TransCOM-PEL (Margoulis et al., 2002), LocusLink, OMIM, and RefSeq (Wheeler et al., 2004).
Sites are experimentally proven for their inclusion in the database The experimental evi-dence of the transcription factor and the DNA-binding site is described, and the cell type from which the factor is derived is linked to the respective entry in the CELL table A set of weight matrices are derived from the collection of binding sites These matrices are recorded in the MATRIX table Moreover, as determined by their DNA-binding domain, the transcription fac-tors are assigned to a certain class, and hence link to the CLASS table is established The starting point for accessing these databases is the following web-site: www.gene-regulation.com As an example, consider the following somewhat edited entry from the SITES table in the TRANSFAC database shown in Figure 59.3 The entry provides a wide variety
of information about the transcription factor, such as the binding sequence motif (SQ) and the first (SF) and the last position (ST) of the factor binding site The accession number of the binding factor itself is provided (BF) – this is in fact a key for the FACTOR table in TRANSFAC The source of the factor is identified (SO) The specific type of cells where the factor was found to be active are identified, 3T3, C2 myoblasts, and F9 in this case Additional information about these cells is accessible under the CELL table with the accession numbers
of 0003, 0042 and 0069 respectively External database references and their corresponding accession numbers are provided under the (DR) field, as well as publication titles (RT) and citation information (RL)
Trang 2Fig 59.3 A sample record from TRANSFAC database
Another set of patterns are significant for bringing about the structural modifications to the DNA It is necessary for the DNA to be in a structurally open conformation1for the gene expression to successfully occur The MARS or Matrix Attachment regions are relatively short (100-1000 bp long) sequences that anchor the chromatin loops to the nuclear matrix and enable
it to adopt the open conformation needed for gene expression (Bode, 1996) Approximately 100,000 matrix attachment sites are believed to exist in the mammalian nucleus, a number that roughly equals the number of genes MARs have been observed to flank the ends of genic
domains encompassing various transcriptional units (Bode, 1996,Nikolaev et al., 1996) A list
of structural motifs that are responsible for attaching DNA to the nuclear matrix is shown in Table 59.1
59.2.2 Clustering Biological Patterns
Clustering is an important step as it directly impacts the success of the downstream model
gen-eration process Given a set of sequence patterns, S, the objective of the clustering process is to
partition these into groups such that each group represents patterns that are either related due
to sequence level, functional or structural similarity The pattern similarity measured purely at
the sequence level can be measured by the string-edit or Levenstein’s distance In most cases,
the sequence level similarity implies functional and/or structural similarity However, some-times known similarity in function (for example, the categorization of MAR specific patterns above) may be used to form clusters regardless of the sequence level similarity
Consider a given sequence pair, − → a = a
1a2 a n and− →
b = b1b2 b m, where both the
se-quences are defined over the alphabet, A= {A,C,T,G} Let d(a i ,b j) denote the distance
be-1Within a cell, the DNA can be in a loosely packed, open conformation by adopting a 11 nm fiber structure, or in a tightly packed, closed conformation by adopting a 30 nm fiber struc-ture
Trang 31142 Gautam B Singh
Table 59.1 Polymorphism is commonly observed in biological patterns A stochastic basis for pattern representation is thus justifiable The list of motifs that are functionally related to MARs was generated by studying related literature
m7 Curved DNA Signal AAAANNNNNNNAAAANNNNNNNAAAA
m8 Curved DNA Signal TTTTNNNNNNNTTTTNNNNNNNTTTT
m10 Kinked DNA Signal TANNNTGNNNCA
m11 Kinked DNA Signal TANNNCANNNTG
m12 Kinked DNA Signal TGNNNTANNNCA
m13 Kinked DNA Signal TGNNNCANNNTA
m14 Kinked DNA Signal CANNNTANNNTG
m15 Kinked DNA Signal CANNNTGNNNTA
m16 mtopo-II Signal RNYNNCNNGYNGKTNYNY
m17 dtopo-II Signal GTNWAYATTNATNNR
tween the i th symbol of sequence − → a and j thsymbol of sequence− →
b d (a i ,b j) is defined as
|g i − g j | Also, let g(k) be the cost of inserting (or deleting) an additional gap of size k If
the distance between these two pattern sequences of lengths m and n is denoted as D m,n, the
recursive formulation of Levenstein’s distance is defined by
D i , j = Min
⎧
⎨
⎩
D i −1, j−1 + d(a i ,b j ),
Min1≤k≤ j {D i, j−k + g(k)},
Min1≤l≤i {D i −l, j + g(l)} (59.1)
Having computed the similarity between all pattern pairs, a clustering algorithm described
below is applied for grouping these into pattern-clusters This clustering approach is based
upon the work described in (Zahn, 1971, Page, 1974) In this graph-theoretic approach, each
vertex v x represents a pattern x ∈ S, belonging to the set of patterns being clustered The
normalized Levenstein’s distance between two patterns x and y, denoted asδxy, is the weight
of an edge e xy connecting vertices v x and v y The clustering process proceeds as follows:
• Construct a Minimum Spanning Tree (MST) The MST covers the entire set S of
pat-terns The MST is build using Prim’s algorithm Since the MST covers the entire set of
training sequences, it is considered to be the root-cluster that is iteratively sub-divided
into smaller child clusters, by repeated applications of steps and below
• Identify an Inconsistent Edge in the MST This process is based on the value of mean
μiand standard deviationσi of distance values for edges in a cluster C i The cluster and
edge with the largest z-score, zmaxis identified The variable e i jkdenotes weight of an edge
in this cluster C i
z max= Max Max
i j ,k∈C i
e i
jk −μi
σi
(59.2)
• Remove the Inconsistent Edge: The edge identified in step above is further subject to
the condition that its z-score be larger than pre-specified threshold If this condition is
satisfied, the inconsistent edge is removed, causing the cluster containing the inconsistent edge to be split into two child clusters2 However, if the edge’s z-score falls below the
2Removing any edge of a tree (the MST in this case) causes the tree to be split into two trees (into two MSTs in this case)
Trang 4threshold, the iterative subdivision process halts It may be noted that the threshold for inconsistent edge removal is often specified in terms of theσ
The categorization of the DNA pattern sequences into appropriate clusters is essential to train the pattern models as described in the following section The quality of each cluster will
be assessed to ensure that sufficient examples exist in the cluster to train a stochastic model In the absence of sufficient examples, the patterns will be represented as boolean decision trees
59.2.3 Learning Cluster Models
A DNA sequence matrix is a set of fixed-length DNA sequence segments aligned with respect
to an experimentally determined biologically significant site The columns of a DNA sequence matrix are numbered with respect to the biological site, usually starting with a negative
num-ber A DNA sequence motif can be defined as a matrix of depth 4 utilizing a cut-off value The
4-column/mononucleotide matrix description of a genetic signal is based on the assumptions that the motif is of fixed length, and that each nucleotide is independently recognized by a
trans-acting mechanism For example the following frequency matrix has been reported for
the TATAA box
Table 59.2 Weight Matrix for TATA Box
If a set of aligned signal sequences of length “L” corresponding to the functional signal
under consideration, then F = [ f bi ],(b ∈ Σ),( j = 1 L) is the nucleotide frequency matrix, where f bi is the absolute frequency of occurrence of the b-th type of the nucleotide out of the
setΣ = {A, C, G, T} at the i-th position along the functional site.
The frequency matrix may be utilized for developing an un-gapped score model when searching for the sites in a sequence Typically a log-odds scoring scheme is utilized for this
purpose of searching for pattern xof length Las shown in Eq (63.3) The quantity e i (b) speci-fies the probability of observing the base b at position i is defined using the frequency matrix such as the one shown above The quantity q(b) represents the background probability for the base b.
S=∑L
i=1
loge i (x i)
The elements of loge i (x i)
q (x i) behave like a scoring matrix similar to the PAM and BLOSUM
matrices The term Position Specific Scoring Matrix (PSSM) is often used to define the pattern search with matrix A PSSM can be used to search for a match in a longer sequence by
evalu-ating a score S j ,for each starting point j in the sequence from position 1 to (N − L + 1) where Lis the length of the PSSM These optimized weight matrices can be used to search for
func-tional signals in the nucleotide sequences Any nucleotide fragment of length L is analyzed
and tested for assignment to the proper functional signal A matching score of ∑L
i=1W (b i ,i) is
assigned to the nucleotide position being examined along the sequence In the search
formu-lation, b is the base at position i along the biological sequence, and W (b ,i) represents the
Trang 51144 Gautam B Singh
corresponding weight matrix entry for symbol b i occurring at position i along the motif A
more detailed example for learning the PSSM for a pattern cluster is shown in Figure 59.4(a)
A stochastic extension of the PSSM is based on a Markovian representation of biological sequence patterns As a first step toward learning the pattern-HMM one of the two common
HMM architectures must be selected to define the topology These are the fully connected
ergodic architecture and the Left Right (LR) architecture The fully connected architecture
offers a higher level modeling capability, but generally requires a larger set of training data The left-right configuration on the other hand is powerful enough for modeling sequences, does not require a large training set, and facilitates model comprehension Moderate level of available training data often dictates that the LR-HMM be utilized for representing the pattern clusters
The initial parameters for the pattern-HMM are assigned heuristically The number of
states, N, denoted as, S = {S1,S2 ,S N }, in a pattern-HMM may set to as large a value as
the total number of DNA symbols in the longest pattern in that cluster Smaller number of
states are heuristically chosen in practice With each state S i, an emission probability vector corresponding to the emission of each of the symbols,{A,C,T,G}, is associated The process
of generating each pattern is sequential such that the x th symbol generated, D x, is a result of the
HMM being in a hidden state q x = S i The parameters of the HMM are denoted asλ={A,B,π}
and defined as follows (Rabiner, 1989)
A: The N × N matrix A = {a i , j } representing the state transition probabilities.
a i j = Pr[q x+1= S j |q x = S i] 1≤ i, j ≤ N (59.4)
B: The N × k state dependent observation symbol probability matrix for each base n={A,C,T,G} The elements of this matrix, B = {b j (n)}, are defined as follows:
b j (n) = Pr[D x = n|q x = S j] 1≤ j ≤ N,1 ≤ d ≤ k (59.5)
π: The initial state distribution probabilities, π = {π i }.
The Maximally Likelihood Estimation procedure suggested by BaumWelch is next uti-lized for training the each pattern-HMM such that the pattern sequences in a cluster would be
the maximally likely set of samples generated by the underlying HMM Figure 59.4(b)
repre-sents the training methodology applied for learning the HMM parameters based on the local alignment block used for training the PSSM in Figure 59.4(a)
Thus, pattern HMMs may be associated with clusters where the number of instances is large enough to allow us to adequately learn its parameters In the case of smaller clusters, the pattern clusters will be represented as PSSMs, profiles or regular expressions Profiles are
similar to PSSMs (Gribskov et al., 1990) and are generated using the sequences in a cluster
when the alignment between the members of a cluster is strong Regular expressions constitute the method of choice for smaller groups of shorter patterns where compositional statistics are hard to evaluate
59.3 Searching for Meta-Patterns
The process of discovering hierarchical pattern associations is posed in terms of the relation-ships between models of a family of patterns, rather than between individual patterns This
Trang 6Fig 59.4 (a) A PSSM based model induced from a Multiple Sequence Alignment (b) A HMM induced from the same alignment
will enable us to validate the meta-pattern hypotheses in a computationally tractable manner
The patterns are associated when they occur within a specific distance of each other, called their association interval The association interval will be established using a split-and-merge
procedure Using a default association interval of 1000 bp, the overall significance of patterns found by splitting this intervals is assessed Additionally, the neighboring windows are merged
to assess the statistical significance of larger regions In this manner, the region with the high-est level of significance is considered as the association interval for a group of patterns
Trang 71146 Gautam B Singh
The statistical significance is associated with each pattern-model pair detected within an association interval This is achieved through two levels of searching the GenBank3 Level I search yields the regions that exhibit a high concentration of patterns This is the first step toward generating pattern association hypotheses that are biologically significant, as patterns working in coordination are generally expected to be localized close to each other In Level II search aims at building support and confidence where Level I hypotheses may be accepted or
rejected based on pre-specified criteria Consider, for example, two patterns A and B where
there is a strong correlation between these two pattern HMMs in the Level I search However, Level II search may reveal that there are a substantially large number of instances outside the
high pattern density regions where their occurrence is independent of each other This will
lead to the rejection of the A ≡ B hypothesis.
59.3.1 Level I Search: Locating High Pattern Density Region
High Pattern Density Regions or HPDRs aims at isolating the regions on the DNA sequence where the patterns modeled by the HMMs occur in a density that is higher than expected The level I search is aimed at identifying HPDR as shown in Figure 59.5 These regions may
be located by measuring the significance of patterns detected in a window of size W located
at a given position on the sequence A numerical value for pattern-density at location x on
the DNA sequence is obtained by treating the pattern occurrences within a window centered
at location x as trials from independent Poisson processes The null hypothesis, H0, tested in each window is essentially that the pattern frequencies observed in the window are no different from those expected in a random sequence
Large deviation from the expected frequency of patterns in a window forces the rejection
of H0 The level of confidence with which H0is rejected is used to assign a statistical
pattern-density metric to the window Specifically, the pattern pattern-density in a window is defined to be,
ρ = -log(p), where the p is the probability of erroneously rejecting H0 As a matter of detail
it may be noted that the value ofρ is computed for both the forward and the reverse DNA strands and the average of the two is taken to be the true density estimate for that location
Fig 59.5 High Pattern Density Regions or HPDRs are detected by statistical means for all sequences in the database
3GenBank is the database of DNA sequences that is publicly accessible from the National Institute of Health, Bethesda, MD, USA
Trang 8In order to computeρ, assume that we are searching for k distinct types of patterns within
a given window of the sequence In general, these patterns are defined as rules R1, R2, ,
Rk The probability of random occurrence of the various k patterns is calculated using the AND-OR relationships between the individual motifs Assume that these probabilities for k patterns are p1, p2, ,p k Next, a random vector of pattern frequencies, F, is constructed F
is a k-dimensional vector with components, F= {x1,x2, ,x k }, where each component x iis
a random variable representing the frequency of the pattern Ri in the W base-pair window The component random variables x i are assumed to be independently distributed Poisson
pro-cesses, each with the parameterλi = p i ·W Thus, the joint probability of observing a frequency
vector F obs={f1, f2, ,f k } purely by chance is given by:
P (F obs) =∏k
i=1
e −λiλf i
The steps required for computation ofα, the cumulative probability that pattern
frequen-cies equal to or greater than the vector F obsoccurs purely by chance is given by Eq (59.8) below This corresponds to the one-sided integral of the multivariate Poisson distribution and
represents the probability that the H0is erroneously rejected
α = Pr(x1≥ f1,x2≥ f2, ,x k ≥ f k)
= Pr(x1≥ f1) ∧ Pr(x2≥ f2) ∧ ∧ Pr(x k ≥ f k)
= ∑∞
x1= f1
exp−λ1 λx1
1
x1! · ∑∞
x2= f2
exp−λ2 λx2
2
x2! ∑∞
x K = f K
exp− λKλxK k
x k!
(59.8)
The p-value, α, in Eq (59.8) is utilized to compute the value of ρ or the cluster-density
as specified in Eq (59.9) below:
ρ = ln1.0
α = −ln(α)
= ∑k
i=1λi+ ∑k
i=1ln f i!− ∑k
i=1f ilnλi −
k
∑
i=1ln(1 + λi
f i+1+ + ( f i +1)( f i +2) ( fλt i +t))
(59.9)
The infinite summation term in Eq (59.9) quickly converges and thus can be adaptively calculated to the precision desired For small values ofλi, the series may be truncated such that the last term is smaller than an arbitrarily small constant,ε
Fig 59.6 The analysis of human protamine gene cluster using the MAR-Finder tool Default analysis parameters were used
Trang 91148 Gautam B Singh
Figure 59.6 presents the output from the analysis of the human protamine gene sequence This statistical inference algorithm based on the association of patterns found within the close
proximity of a DNA sequence region has been incorporated in the MAR-Finder tool A java-enabled version of the tool described in (Singh et al., 1997) is also available for public access
from http://www.MarFinder.com
We also need t take into consideration the interdependence of pattern occurrences Let f i j correspond to the observed frequency of pattern defined by pattern-HMM H j in the i th
win-dow sample Using the frequency data from n winwin-dow samples, and the mean frequency, − →
( f i ), the correlation matrix, R = (r i j) can be evaluated as follows:
r i j= s i j
s i s j =
n
∑
r=1( f ri − f i )( f r j − f j)
> n
∑
r=1( f ri − f i)2
> n
∑
r=1( f r j − f j)2
(59.10)
If the sample correlation matrix, R, is equal to the identity matrix, the variables can be considered to be uncorrelated or independent The hypothesis r i j= 0 can be tested using the
statistic t i j defined in Eq.(59.11) t i jfollows a Student’s distribution with(n − 2) degrees of
freedom (Kachigan, 1986)
t i j=r i j
√ n− 2
)
1− r2
i j
(59.11)
If a pattern interdependence is detected, the pairwise correlation terms in R can be used
to remove surrogate variables, i.e one of the two patterns that exhibit a high degree of
cor-relation Removal of surrogate variables results in retaining a core subset of original patterns that account for the variability of the observed data (Hair et al., 1987) Let there be k such
core patterns that get retained for subsequent analysis stage If the pairwise correlation terms
of R k are non-zero, the Mahalanobis Transformation can be applied to the vector − →
f kto
trans-form it to a vector − → z
k The property of such a transformation is that the correlation matrix
of the transformed variables is guaranteed to be the identity matrix I (Mardia et al., 1979) The Mahalanobis Transformation for obtaining the uncorrelated vector − → z
kfrom the observed frequency of core vectors− →
f k is specified in Eq (59.12), with the l idenoting the eigenvalues
−
→ z
k = S − k1(− → f k − − → f k)
S −1
k =Γ Λ−1
where− →
f kis the observed frequency vector andΛ−1
= diag(l i −1) The value forα can next be computed based on the transformed vector − → z k as shown in
Eq (59.13) The components of the transformed vector are independent, and thus the
multi-plication of individual probability terms is justifiable Each component, z i, represents a linear combination of the observed frequency values
α = Pr(z1≥ z f1,z2≥ z f2, ,z c ≥ z f c)
= Pr(z1≥ z f1) · Pr(z2≥ z f2) · · Pr(z c ≥ z f c)
= ∞
z1=!zf1 "
e −1
z1!· ∞
z2=!zf2 "
e −1
z2!· · ∞
z c =!z fk "
e
Trang 1059.3.2 Level II Search: Meta-Pattern Hypotheses
The meta- or higher level pattern hypotheses are generated and tested within the HPDRs
Specifically, the Pattern Association (PA) hypotheses are generated and verified within these HPDRs These PA hypotheses are build in a bottom-up manner from the validation of pair-wise associations For example, for two patterns A and B, a PA-hypothesis that we might validate is that A → B, with the usual semantics that the occurrence of a pattern A implies that occurrence
of pattern B within a pre-specified association distance Furthermore, if the PA-hypothesis stating that B → A is also validated, the relationship between patterns A and B is promoted to
that of Pattern Equivalence (PE), denoted as A ↔ B or A ≡ B Transitivity can be used to build
larger groups of associations, such that if A → B and B → C , then the implication A → BC
may be concluded
Similar statement can be made about the PE-hypotheses4 Meta pattern formation using
transitivity rules will lead to the discovery of mosaic type meta-patterns For the purpose of
developing a methodology for systematically generating PA-hypotheses, the DNA sequence
is represented as a sequence of a 2-elements The first element in this sequence is the pattern
match location, and the second element identifies the specific HMM(s) that matched (It is pos-sible for more than one pattern model to match the DNA sequence at a specific location) Such
a representation shown in Eq.(59.14), is denoted as F S , is the pattern-sequence corresponding
to the biological sequence S.
F S = (x1,P a ),(x2,P b ), ,(x i ,P r ), ,(x n ,P v ) (59.14) Eq.(59.15) specifies the set of pattern hypotheses generated within each HPDR The
op-erator cadr(L) is used to denote car(cdr(L)) The equation specifies that unique hypotheses are formed considering the closest pattern P y instance to a given pattern P xinstance
H AB=
{(A,B)|A = cadr(P x ) ∧ B = cadr(P y )∧
ΔA ,B = ||car(P x ) − car(P y )|| ∧ (Δ A ,B < θ)∧
(¬∃P z )( ||car(P x ) − car(P z )|| <ΔA,B )} (59.15)
A N ×N matrix C, similar to a contingency table (Gokhale, 1978,Brien, 1989) is used for
recording significance of the each PA-hypotheses generated from the analysis of all HPDRs
in the entire set of sequences Recall that these regions were identified during the Level I
search The score for cell C A ,Bis updated according to Eq (59.16) for every pattern pair (A,B)
hypotheses H ABgenerated in these regions The probabilities of random occurrence of patterns
A and B are p A and p Brespectively, andΔA,Bis the distance between them
C AB = C AB+ρAB
= C AB+ (λA+λB − lnλA − lnλB)
where λA = p AΔAB , and λB = p BΔAB
(59.16)
Information theoretical approach based on mutual information content is next utilized for characterizing the strengths between pattern pairs Contents of the contingency table (after all the sequences in the database have been processed) need to be converted to correspond to
4The functional significance of A → BC is that protein binding to site A will lead to the the
binding of proteins at sites B and C For a meta-pattern of the form A ↔ B, both the proteins
must simultaneously bind to bring forth the necessary function