Data Mining and Knowledge Discovery Handbook, 2 Edition part 117 pot

The entry provides a wide variety of information about the transcription factor, such as the binding sequence motif SQ and the ﬁrst SF and the last position ST of the factor binding site

Trang 1

1140 Gautam B Singh

Fig 59.2 The processing stages required for generating the Markovian models for DNA pat-terns

these elements and deﬁnes consensus and matrices for elements of certain function, and thus

to provide means of identifying regulatory signals in anonymous genomic sequences

TRANSFAC: Transcription Factors and Regulation

The development path for the TRANSFAC database has been geared by the objective to

pro-vide a biological context for understanding the function of regulatory signals found in genomic

sequences The aim of this compilation of signals was meant to provide all relevant data about the regulating proteins and allow researchers to trace back transcriptional control cascades to

their origin (Wingender et al., 2001, Matys et al., 2003) The TRANSFAC database contains

information about regulatory DNA sequences and the transcription factors binding to and act-ing through them At the core of this database are its components describact-ing the transcription factor (FACTOR) and its corresponding binding site (SITE) and the regulation of the cor-responding gene (GENE) The GENE table is one of the central tables in this database It

is linked to several other databases including S/MARtDB (Liebich et al., 2002), TransCOM-PEL (Margoulis et al., 2002), LocusLink, OMIM, and RefSeq (Wheeler et al., 2004).

Sites are experimentally proven for their inclusion in the database The experimental evi-dence of the transcription factor and the DNA-binding site is described, and the cell type from which the factor is derived is linked to the respective entry in the CELL table A set of weight matrices are derived from the collection of binding sites These matrices are recorded in the MATRIX table Moreover, as determined by their DNA-binding domain, the transcription fac-tors are assigned to a certain class, and hence link to the CLASS table is established The starting point for accessing these databases is the following web-site: www.gene-regulation.com As an example, consider the following somewhat edited entry from the SITES table in the TRANSFAC database shown in Figure 59.3 The entry provides a wide variety

of information about the transcription factor, such as the binding sequence motif (SQ) and the first (SF) and the last position (ST) of the factor binding site The accession number of the binding factor itself is provided (BF) – this is in fact a key for the FACTOR table in TRANSFAC The source of the factor is identified (SO) The specific type of cells where the factor was found to be active are identified, 3T3, C2 myoblasts, and F9 in this case Additional information about these cells is accessible under the CELL table with the accession numbers

of 0003, 0042 and 0069 respectively External database references and their corresponding accession numbers are provided under the (DR) ﬁeld, as well as publication titles (RT) and citation information (RL)

Trang 2

Fig 59.3 A sample record from TRANSFAC database

Another set of patterns are signiﬁcant for bringing about the structural modiﬁcations to the DNA It is necessary for the DNA to be in a structurally open conformation1for the gene expression to successfully occur The MARS or Matrix Attachment regions are relatively short (100-1000 bp long) sequences that anchor the chromatin loops to the nuclear matrix and enable

it to adopt the open conformation needed for gene expression (Bode, 1996) Approximately 100,000 matrix attachment sites are believed to exist in the mammalian nucleus, a number that roughly equals the number of genes MARs have been observed to ﬂank the ends of genic

domains encompassing various transcriptional units (Bode, 1996,Nikolaev et al., 1996) A list

of structural motifs that are responsible for attaching DNA to the nuclear matrix is shown in Table 59.1

59.2.2 Clustering Biological Patterns

Clustering is an important step as it directly impacts the success of the downstream model

gen-eration process Given a set of sequence patterns, S, the objective of the clustering process is to

partition these into groups such that each group represents patterns that are either related due

to sequence level, functional or structural similarity The pattern similarity measured purely at

the sequence level can be measured by the string-edit or Levenstein’s distance In most cases,

the sequence level similarity implies functional and/or structural similarity However, some-times known similarity in function (for example, the categorization of MAR speciﬁc patterns above) may be used to form clusters regardless of the sequence level similarity

Consider a given sequence pair, − → a = a

1a2 a n and− →

b = b1b2 b m, where both the

se-quences are deﬁned over the alphabet, A= {A,C,T,G} Let d(a i ,b j) denote the distance

be-1Within a cell, the DNA can be in a loosely packed, open conformation by adopting a 11 nm ﬁber structure, or in a tightly packed, closed conformation by adopting a 30 nm ﬁber struc-ture

Trang 3

1142 Gautam B Singh

Table 59.1 Polymorphism is commonly observed in biological patterns A stochastic basis for pattern representation is thus justiﬁable The list of motifs that are functionally related to MARs was generated by studying related literature

m7 Curved DNA Signal AAAANNNNNNNAAAANNNNNNNAAAA

m8 Curved DNA Signal TTTTNNNNNNNTTTTNNNNNNNTTTT

m10 Kinked DNA Signal TANNNTGNNNCA

m11 Kinked DNA Signal TANNNCANNNTG

m12 Kinked DNA Signal TGNNNTANNNCA

m13 Kinked DNA Signal TGNNNCANNNTA

m14 Kinked DNA Signal CANNNTANNNTG

m15 Kinked DNA Signal CANNNTGNNNTA

m16 mtopo-II Signal RNYNNCNNGYNGKTNYNY

m17 dtopo-II Signal GTNWAYATTNATNNR

tween the i th symbol of sequence − → a and j thsymbol of sequence− →

b d (a i ,b j) is deﬁned as

|g i − g j | Also, let g(k) be the cost of inserting (or deleting) an additional gap of size k If

the distance between these two pattern sequences of lengths m and n is denoted as D m,n, the

recursive formulation of Levenstein’s distance is deﬁned by

D i , j = Min

⎧

⎨

⎩

D i −1, j−1 + d(a i ,b j ),

Min1≤k≤ j {D i, j−k + g(k)},

Min1≤l≤i {D i −l, j + g(l)} (59.1)

Having computed the similarity between all pattern pairs, a clustering algorithm described

below is applied for grouping these into pattern-clusters This clustering approach is based

upon the work described in (Zahn, 1971, Page, 1974) In this graph-theoretic approach, each

vertex v x represents a pattern x ∈ S, belonging to the set of patterns being clustered The

normalized Levenstein’s distance between two patterns x and y, denoted asδxy, is the weight

of an edge e xy connecting vertices v x and v y The clustering process proceeds as follows:

• Construct a Minimum Spanning Tree (MST) The MST covers the entire set S of

pat-terns The MST is build using Prim’s algorithm Since the MST covers the entire set of

training sequences, it is considered to be the root-cluster that is iteratively sub-divided

into smaller child clusters, by repeated applications of steps and below

• Identify an Inconsistent Edge in the MST This process is based on the value of mean

μiand standard deviationσi of distance values for edges in a cluster C i The cluster and

edge with the largest z-score, zmaxis identiﬁed The variable e i jkdenotes weight of an edge

in this cluster C i

z max= Max Max

i j ,k∈C i

e i

jk −μi

σi

(59.2)

• Remove the Inconsistent Edge: The edge identiﬁed in step above is further subject to

the condition that its z-score be larger than pre-speciﬁed threshold If this condition is

satisﬁed, the inconsistent edge is removed, causing the cluster containing the inconsistent edge to be split into two child clusters2 However, if the edge’s z-score falls below the

2Removing any edge of a tree (the MST in this case) causes the tree to be split into two trees (into two MSTs in this case)

Trang 4

threshold, the iterative subdivision process halts It may be noted that the threshold for inconsistent edge removal is often speciﬁed in terms of theσ

The categorization of the DNA pattern sequences into appropriate clusters is essential to train the pattern models as described in the following section The quality of each cluster will

be assessed to ensure that sufﬁcient examples exist in the cluster to train a stochastic model In the absence of sufﬁcient examples, the patterns will be represented as boolean decision trees

59.2.3 Learning Cluster Models

A DNA sequence matrix is a set of ﬁxed-length DNA sequence segments aligned with respect

to an experimentally determined biologically signiﬁcant site The columns of a DNA sequence matrix are numbered with respect to the biological site, usually starting with a negative

num-ber A DNA sequence motif can be deﬁned as a matrix of depth 4 utilizing a cut-off value The

4-column/mononucleotide matrix description of a genetic signal is based on the assumptions that the motif is of ﬁxed length, and that each nucleotide is independently recognized by a

trans-acting mechanism For example the following frequency matrix has been reported for

the TATAA box

Table 59.2 Weight Matrix for TATA Box

If a set of aligned signal sequences of length “L” corresponding to the functional signal

under consideration, then F = [ f bi ],(b ∈ Σ),( j = 1 L) is the nucleotide frequency matrix, where f bi is the absolute frequency of occurrence of the b-th type of the nucleotide out of the

setΣ = {A, C, G, T} at the i-th position along the functional site.

The frequency matrix may be utilized for developing an un-gapped score model when searching for the sites in a sequence Typically a log-odds scoring scheme is utilized for this

purpose of searching for pattern xof length Las shown in Eq (63.3) The quantity e i (b) speci-ﬁes the probability of observing the base b at position i is deﬁned using the frequency matrix such as the one shown above The quantity q(b) represents the background probability for the base b.

S=∑L

i=1

loge i (x i)

The elements of loge i (x i)

q (x i) behave like a scoring matrix similar to the PAM and BLOSUM

matrices The term Position Speciﬁc Scoring Matrix (PSSM) is often used to deﬁne the pattern search with matrix A PSSM can be used to search for a match in a longer sequence by

evalu-ating a score S j ,for each starting point j in the sequence from position 1 to (N − L + 1) where Lis the length of the PSSM These optimized weight matrices can be used to search for

func-tional signals in the nucleotide sequences Any nucleotide fragment of length L is analyzed

and tested for assignment to the proper functional signal A matching score of ∑L

i=1W (b i ,i) is

assigned to the nucleotide position being examined along the sequence In the search

formu-lation, b is the base at position i along the biological sequence, and W (b ,i) represents the

Trang 5

1144 Gautam B Singh

corresponding weight matrix entry for symbol b i occurring at position i along the motif A

more detailed example for learning the PSSM for a pattern cluster is shown in Figure 59.4(a)

A stochastic extension of the PSSM is based on a Markovian representation of biological sequence patterns As a ﬁrst step toward learning the pattern-HMM one of the two common

HMM architectures must be selected to deﬁne the topology These are the fully connected

ergodic architecture and the Left Right (LR) architecture The fully connected architecture

offers a higher level modeling capability, but generally requires a larger set of training data The left-right conﬁguration on the other hand is powerful enough for modeling sequences, does not require a large training set, and facilitates model comprehension Moderate level of available training data often dictates that the LR-HMM be utilized for representing the pattern clusters

The initial parameters for the pattern-HMM are assigned heuristically The number of

states, N, denoted as, S = {S1,S2 ,S N }, in a pattern-HMM may set to as large a value as

the total number of DNA symbols in the longest pattern in that cluster Smaller number of

states are heuristically chosen in practice With each state S i, an emission probability vector corresponding to the emission of each of the symbols,{A,C,T,G}, is associated The process

of generating each pattern is sequential such that the x th symbol generated, D x, is a result of the

HMM being in a hidden state q x = S i The parameters of the HMM are denoted asλ={A,B,π}

and deﬁned as follows (Rabiner, 1989)

A: The N × N matrix A = {a i , j } representing the state transition probabilities.

a i j = Pr[q x+1= S j |q x = S i] 1≤ i, j ≤ N (59.4)

B: The N × k state dependent observation symbol probability matrix for each base n={A,C,T,G} The elements of this matrix, B = {b j (n)}, are deﬁned as follows:

b j (n) = Pr[D x = n|q x = S j] 1≤ j ≤ N,1 ≤ d ≤ k (59.5)

π: The initial state distribution probabilities, π = {π i }.

The Maximally Likelihood Estimation procedure suggested by BaumWelch is next uti-lized for training the each pattern-HMM such that the pattern sequences in a cluster would be

the maximally likely set of samples generated by the underlying HMM Figure 59.4(b)

repre-sents the training methodology applied for learning the HMM parameters based on the local alignment block used for training the PSSM in Figure 59.4(a)

Thus, pattern HMMs may be associated with clusters where the number of instances is large enough to allow us to adequately learn its parameters In the case of smaller clusters, the pattern clusters will be represented as PSSMs, proﬁles or regular expressions Proﬁles are

similar to PSSMs (Gribskov et al., 1990) and are generated using the sequences in a cluster

when the alignment between the members of a cluster is strong Regular expressions constitute the method of choice for smaller groups of shorter patterns where compositional statistics are hard to evaluate

59.3 Searching for Meta-Patterns

The process of discovering hierarchical pattern associations is posed in terms of the relation-ships between models of a family of patterns, rather than between individual patterns This

Trang 6

Fig 59.4 (a) A PSSM based model induced from a Multiple Sequence Alignment (b) A HMM induced from the same alignment

will enable us to validate the meta-pattern hypotheses in a computationally tractable manner

The patterns are associated when they occur within a speciﬁc distance of each other, called their association interval The association interval will be established using a split-and-merge

procedure Using a default association interval of 1000 bp, the overall signiﬁcance of patterns found by splitting this intervals is assessed Additionally, the neighboring windows are merged

to assess the statistical signiﬁcance of larger regions In this manner, the region with the high-est level of signiﬁcance is considered as the association interval for a group of patterns

Trang 7

1146 Gautam B Singh

The statistical significance is associated with each pattern-model pair detected within an association interval This is achieved through two levels of searching the GenBank3 Level I search yields the regions that exhibit a high concentration of patterns This is the first step toward generating pattern association hypotheses that are biologically significant, as patterns working in coordination are generally expected to be localized close to each other In Level II search aims at building support and confidence where Level I hypotheses may be accepted or

rejected based on pre-speciﬁed criteria Consider, for example, two patterns A and B where

there is a strong correlation between these two pattern HMMs in the Level I search However, Level II search may reveal that there are a substantially large number of instances outside the

high pattern density regions where their occurrence is independent of each other This will

lead to the rejection of the A ≡ B hypothesis.

59.3.1 Level I Search: Locating High Pattern Density Region

High Pattern Density Regions or HPDRs aims at isolating the regions on the DNA sequence where the patterns modeled by the HMMs occur in a density that is higher than expected The level I search is aimed at identifying HPDR as shown in Figure 59.5 These regions may

be located by measuring the signiﬁcance of patterns detected in a window of size W located

at a given position on the sequence A numerical value for pattern-density at location x on

the DNA sequence is obtained by treating the pattern occurrences within a window centered

at location x as trials from independent Poisson processes The null hypothesis, H0, tested in each window is essentially that the pattern frequencies observed in the window are no different from those expected in a random sequence

Large deviation from the expected frequency of patterns in a window forces the rejection

of H0 The level of conﬁdence with which H0is rejected is used to assign a statistical

pattern-density metric to the window Speciﬁcally, the pattern pattern-density in a window is deﬁned to be,

ρ = -log(p), where the p is the probability of erroneously rejecting H0 As a matter of detail

it may be noted that the value ofρ is computed for both the forward and the reverse DNA strands and the average of the two is taken to be the true density estimate for that location

Fig 59.5 High Pattern Density Regions or HPDRs are detected by statistical means for all sequences in the database

3GenBank is the database of DNA sequences that is publicly accessible from the National Institute of Health, Bethesda, MD, USA

Trang 8

In order to computeρ, assume that we are searching for k distinct types of patterns within

a given window of the sequence In general, these patterns are deﬁned as rules R1, R2, ,

Rk The probability of random occurrence of the various k patterns is calculated using the AND-OR relationships between the individual motifs Assume that these probabilities for k patterns are p1, p2, ,p k Next, a random vector of pattern frequencies, F, is constructed F

is a k-dimensional vector with components, F= {x1,x2, ,x k }, where each component x iis

a random variable representing the frequency of the pattern Ri in the W base-pair window The component random variables x i are assumed to be independently distributed Poisson

pro-cesses, each with the parameterλi = p i ·W Thus, the joint probability of observing a frequency

vector F obs={f1, f2, ,f k } purely by chance is given by:

P (F obs) =∏k

i=1

e −λiλf i

The steps required for computation ofα, the cumulative probability that pattern

frequen-cies equal to or greater than the vector F obsoccurs purely by chance is given by Eq (59.8) below This corresponds to the one-sided integral of the multivariate Poisson distribution and

represents the probability that the H0is erroneously rejected

α = Pr(x1≥ f1,x2≥ f2, ,x k ≥ f k)

= Pr(x1≥ f1) ∧ Pr(x2≥ f2) ∧ ∧ Pr(x k ≥ f k)

= ∑∞

x1= f1

exp−λ1 λx1

1

x1! · ∑∞

x2= f2

exp−λ2 λx2

2

x2! ∑∞

x K = f K

exp− λKλxK k

x k!

(59.8)

The p-value, α, in Eq (59.8) is utilized to compute the value of ρ or the cluster-density

as speciﬁed in Eq (59.9) below:

ρ = ln1.0

α = −ln(α)

= ∑k

i=1λi+ ∑k

i=1ln f i!− ∑k

i=1f ilnλi −

k

∑

i=1ln(1 + λi

f i+1+ + ( f i +1)( f i +2) ( fλt i +t))

(59.9)

The inﬁnite summation term in Eq (59.9) quickly converges and thus can be adaptively calculated to the precision desired For small values ofλi, the series may be truncated such that the last term is smaller than an arbitrarily small constant,ε

Fig 59.6 The analysis of human protamine gene cluster using the MAR-Finder tool Default analysis parameters were used

Trang 9

1148 Gautam B Singh

Figure 59.6 presents the output from the analysis of the human protamine gene sequence This statistical inference algorithm based on the association of patterns found within the close

proximity of a DNA sequence region has been incorporated in the MAR-Finder tool A java-enabled version of the tool described in (Singh et al., 1997) is also available for public access

from http://www.MarFinder.com

We also need t take into consideration the interdependence of pattern occurrences Let f i j correspond to the observed frequency of pattern deﬁned by pattern-HMM H j in the i th

win-dow sample Using the frequency data from n winwin-dow samples, and the mean frequency, − →

( f i ), the correlation matrix, R = (r i j) can be evaluated as follows:

r i j= s i j

s i s j =

n

∑

r=1( f ri − f i )( f r j − f j)

> n

∑

r=1( f ri − f i)2

> n

∑

r=1( f r j − f j)2

(59.10)

If the sample correlation matrix, R, is equal to the identity matrix, the variables can be considered to be uncorrelated or independent The hypothesis r i j= 0 can be tested using the

statistic t i j deﬁned in Eq.(59.11) t i jfollows a Student’s distribution with(n − 2) degrees of

freedom (Kachigan, 1986)

t i j=r i j

√ n− 2

)

1− r2

i j

(59.11)

If a pattern interdependence is detected, the pairwise correlation terms in R can be used

to remove surrogate variables, i.e one of the two patterns that exhibit a high degree of

cor-relation Removal of surrogate variables results in retaining a core subset of original patterns that account for the variability of the observed data (Hair et al., 1987) Let there be k such

core patterns that get retained for subsequent analysis stage If the pairwise correlation terms

of R k are non-zero, the Mahalanobis Transformation can be applied to the vector − →

f kto

trans-form it to a vector − → z

k The property of such a transformation is that the correlation matrix

of the transformed variables is guaranteed to be the identity matrix I (Mardia et al., 1979) The Mahalanobis Transformation for obtaining the uncorrelated vector − → z

kfrom the observed frequency of core vectors− →

f k is speciﬁed in Eq (59.12), with the l idenoting the eigenvalues

−

→ z

k = S − k1(− → f k − − → f k)

S −1

k =Γ Λ−1

where− →

f kis the observed frequency vector andΛ−1

= diag(l i −1) The value forα can next be computed based on the transformed vector − → z k as shown in

Eq (59.13) The components of the transformed vector are independent, and thus the

multi-plication of individual probability terms is justiﬁable Each component, z i, represents a linear combination of the observed frequency values

α = Pr(z1≥ z f1,z2≥ z f2, ,z c ≥ z f c)

= Pr(z1≥ z f1) · Pr(z2≥ z f2) · · Pr(z c ≥ z f c)

= ∞

z1=!zf1 "

e −1

z1!· ∞

z2=!zf2 "

e −1

z2!· · ∞

z c =!z fk "

e

Trang 10

59.3.2 Level II Search: Meta-Pattern Hypotheses

The meta- or higher level pattern hypotheses are generated and tested within the HPDRs

Speciﬁcally, the Pattern Association (PA) hypotheses are generated and veriﬁed within these HPDRs These PA hypotheses are build in a bottom-up manner from the validation of pair-wise associations For example, for two patterns A and B, a PA-hypothesis that we might validate is that A → B, with the usual semantics that the occurrence of a pattern A implies that occurrence

of pattern B within a pre-speciﬁed association distance Furthermore, if the PA-hypothesis stating that B → A is also validated, the relationship between patterns A and B is promoted to

that of Pattern Equivalence (PE), denoted as A ↔ B or A ≡ B Transitivity can be used to build

larger groups of associations, such that if A → B and B → C , then the implication A → BC

may be concluded

Similar statement can be made about the PE-hypotheses4 Meta pattern formation using

transitivity rules will lead to the discovery of mosaic type meta-patterns For the purpose of

developing a methodology for systematically generating PA-hypotheses, the DNA sequence

is represented as a sequence of a 2-elements The ﬁrst element in this sequence is the pattern

match location, and the second element identifies the specific HMM(s) that matched (It is pos-sible for more than one pattern model to match the DNA sequence at a specific location) Such

a representation shown in Eq.(59.14), is denoted as F S , is the pattern-sequence corresponding

to the biological sequence S.

F S = (x1,P a ),(x2,P b ), ,(x i ,P r ), ,(x n ,P v ) (59.14) Eq.(59.15) speciﬁes the set of pattern hypotheses generated within each HPDR The

op-erator cadr(L) is used to denote car(cdr(L)) The equation speciﬁes that unique hypotheses are formed considering the closest pattern P y instance to a given pattern P xinstance

H AB=

{(A,B)|A = cadr(P x ) ∧ B = cadr(P y )∧

ΔA ,B = ||car(P x ) − car(P y )|| ∧ (Δ A ,B < θ)∧

(¬∃P z )( ||car(P x ) − car(P z )|| <ΔA,B )} (59.15)

A N ×N matrix C, similar to a contingency table (Gokhale, 1978,Brien, 1989) is used for

recording signiﬁcance of the each PA-hypotheses generated from the analysis of all HPDRs

in the entire set of sequences Recall that these regions were identiﬁed during the Level I

search The score for cell C A ,Bis updated according to Eq (59.16) for every pattern pair (A,B)

hypotheses H ABgenerated in these regions The probabilities of random occurrence of patterns

A and B are p A and p Brespectively, andΔA,Bis the distance between them

C AB = C AB+ρAB

= C AB+ (λA+λB − lnλA − lnλB)

where λA = p AΔAB , and λB = p BΔAB

(59.16)

Information theoretical approach based on mutual information content is next utilized for characterizing the strengths between pattern pairs Contents of the contingency table (after all the sequences in the database have been processed) need to be converted to correspond to

4The functional signiﬁcance of A → BC is that protein binding to site A will lead to the the

binding of proteins at sites B and C For a meta-pattern of the form A ↔ B, both the proteins

must simultaneously bind to bring forth the necessary function

Định dạng
Số trang	10
Dung lượng	253,89 KB