Tài liệu Báo cáo khoa học: Isochore structures in the chicken genome ppt

By applying the new segmentation algorithm to the draft chicken genome sequence, the mosaic organization of the chicken genome can be conﬁrmed at the sequence level.. By applying the seg

Trang 1

Feng Gao and Chun-Ting Zhang

Department of Physics, Tianjin University, China

The ﬁrst draft genome sequence of the red jungle

fowl, Gallus gallus, was published in December 2004

The chicken (G gallus) is an important model

organ-ism that bridges the evolutionary gap between

mam-mals and other vertebrates and serves as a main

laboratory model for the 9600 extant avian species

The chicken also represents the ﬁrst agricultural

ani-mal to have its genome sequenced Like most bird

species, the chicken has a relatively small genome of

1200 million base pairs, or 39% of the size of

the human genome [1]

The nuclear genomes of vertebrates are mosaics of

isochores, very long stretches [> 300 kilobases (kb)] of

DNA that are fairly homogeneous in base

composi-tion Isochores can be partitioned into a small number

of families that cover a range of GC levels, which is

narrow in cold-blooded vertebrates, but broad in warm-blooded vertebrates [2,3] The large-scale vari-ation in base composition correlates both coding and noncoding sequences and seems to reﬂect a fundamen-tal level of genome organization [4] This isochore organization shows marked variation in a number of important genomic features, including gene density [5], chromosome bands [6,7], patterns of codon usage [8], gene length [9], replication timing [10], recombination rate [11,12], and the distribution of transposable ele-ments [13] By in situ hybridization of fractionated DNA on mitotic and meiotic chromosomes, a com-positional map of chicken chromosomes has been obtained and the most gene-rich regions have been studied [14] Now, the availability of the complete chicken genome sequence provides an unprecedented

Keywords

compositional homogeneity; compositional

segmentation; Gallus gallus; isochores;

windowless technique

Correspondence

C.-T Zhang, Department of Physics, Tianjin

University, Tianjin 300072, China

Fax: +86 22 27402697

Tel: +86 22 27402987

E-mail: ctzhang@tju.edu.cn

(Received 13 November 2005, revised 5

January 2006, accepted 14 February 2006)

doi:10.1111/j.1742-4658.2006.05178.x

The availability of the complete chicken genome sequence provides an unprecedented opportunity to study the global genome organization at the sequence level Delineating compositionally homogeneous G + C domains

in DNA sequences can provide much insight into the understanding of the organization and biological functions of the chicken genome A new seg-mentation algorithm, which is simple and fast, has been proposed to parti-tion a given genome or DNA sequence into composiparti-tionally distinct domains By applying the new segmentation algorithm to the draft chicken genome sequence, the mosaic organization of the chicken genome can be conﬁrmed at the sequence level It is shown herein that the chicken genome

is also characterized by a mosaic structure of isochores, long DNA seg-ments that are fairly homogeneous in the G + C content Consequently,

25 isochores longer than 2 Mb (megabases) have been identiﬁed in the chicken genome These isochores have a fairly homogeneous G + C con-tent and often correspond to meaningful biological units With the aid of the technique of cumulative GC proﬁle, we proposed an intuitive picture

to display the distribution of segmentation points The relationships between G + C content and the distributions of genes (CpG islands, and other genomic elements) were analyzed in a perceivable manner The cumu-lative GC proﬁle, equipped with the new segmentation algorithm, would be

an appropriate starting point for analyzing the isochore structures of higher eukaryotic genomes

Abbreviations

SNP, single nucleotide polymorphism.

Trang 2

opportunity to study the global genome organization

at the sequence level

In this article, we analyzed the isochore structures of

the chicken genome using a new segmentation

algo-rithm [15] By applying the segmentation algoalgo-rithm to

24 chicken chromosome sequences, the boundaries of

isochores for each chromosome were obtained,

respect-ively It was found that the chicken genome is

organized into a mosaic structure of isochores

Conse-quently, 25 isochores longer than 2 Mb have been

identiﬁed, i.e eight GC-rich isochores and 17 GC-poor

isochores

Results and discussion

The isochores in the chicken genome

It should be noted that the chicken genome sequence

still contains a large number of gaps (Table 1) In the

case of GGA1, there are 9847 gaps remaining There-fore, applying the segmentation algorithm to each frag-ment will fail to unveil the characteristic of the whole genome In order to display the global G + C content distribution along chromosomes, only gaps > 1% of the chromosome size were retained; gaps < 1% of the chromosome size were simply deleted By applying the segmentation algorithm to the resulting contigs of each chromosome, the segmentation points were obtained at

a certain threshold t0, respectively At a given thresh-old t0, the number of resulting segmentation points can reflect the compositional homogeneity of the sequences For instance, the size of GGA6 is similar to that of GGAZ At the same threshold t0¼ 100, there are 161 segmentation points in GGA6, while there are only 58 segmentation points in GGAZ This indicates that GGAZ sequence is more homogeneous than GGA6, and this is also confirmed by Fig 1 The varia-tions of the cumulative GC profile for GGA6 are

Table 1 The summary statistics in the chicken genome The number of isochores longer than 300 kb obtained at t 0 ¼ 100 in each chromo-some is also presented in the table.

Chromosome

Chromosome size (bp)

Number

of gaps

Percent of gaps in the chromosome (%)

G + C content (%)

Number of isochores

Trang 3

much larger than those of the cumulative GC proﬁle

for GGAZ

Here, t0 was chosen with the aid of the cumulative

GC proﬁle and the density distribution of CpG

islands For example, there are 14, 20, and 148

seg-mentation points obtained on GGA14 with t0 set at

1000, 500, and 100, respectively As shown in Fig 2,

the domains obtained can delineate the variations of the cumulative GC proﬁle and the density distribution

of CpG islands more and more accurately with decreasing t0 On the other hand, a smaller t0leads to more segmentation points and shorter segmented sub-sequences Similar procedures were carried out for macrochromosomes, intermediate chromosomes and

Fig 1 The negative cumulative GC profiles for the chicken genome The gaps in the chicken chromosome sequences are left empty in the curves Note that sharp peaks correspond to the sites where G + C content undergoes abrupt changes, from GC-rich regions to GC-poor regions, and vice versa, indicating a mosaic structure of the chromosomes A jump in the z 0

n curve indicates an increase of the G + C con-tent; whereas a drop down in the z 0

n curve indicates a decrease of the G + C content An approximate straight region in the z 0

n curve implies that the G + C content in this region is roughly constant.

Trang 4

sex chromosome Z, respectively Consequently, for

macrochromosomes, intermediate chromosomes and

sex chromosome Z, the threshold t0 is set to 1000 to

partition these chromosomes into compositionally

dis-tinct domains For microchromosomes, which are

much smaller and contain higher density of CpG

islands and genes, t0¼ 500 is adopted in order to

reﬂect more details Finally, t0¼ 100 is used as a

threshold to identify isochores in the chicken genome

Here, the region from 12 579 268–13 821 432

nucleo-tide on GGA14 was deemed as an isochore

The distributions of length and G + C content are

presented in Fig 3, based on all the segments obtained

at t0 ¼ 100 without the constraint of the minimum

length It can be seen that the length distribution is

notably skewed, with the highest value being 10.5 Mb,

corresponding to a region with high-repeat density and

low-gene density on GGA1 The G + C content

distri-bution is also highly skewed, with a long tail of

GC-rich regions It should be noted that the view of

the chicken genome we now have from the sequence

may still be a compositionally biased one, as some of the most GC-rich, CpG-island-rich regions, namely several microchromosomes such as chromosomes 25,

29, 30, or 31, are essentially missing from the sequence

in the currently available chicken genome draft Consequently, 25 isochores longer than 2 Mb (exclu-ding gaps) were identified (Table 2), i.e eight GC-rich isochores and 17 poor isochores In general, GC-rich isochores tend to be shorter than GC-poor ones The classification of isochores adopted here was pro-posed by Zhang and Zhang [16], which is based on the relative magnitude of the G + C content of isochores with respect to the genomic G + C content Accord-ing to this classification, the G + C content of GC-rich isochores (GC-poor isochores) is higher (lower) than the genomic G + C content

Biological implications of isochores With the aid of the technique of cumulative GC pro-ﬁle, we proposed an intuitive picture to display the dis-tribution of segmentation points The relationships between G + C content and the distributions of genes (CpG islands, and other genomic elements) can be an-alyzed in a perceivable manner The cumulative GC proﬁle is also called the zn0 curve, which is a discrete function of the nucleotide position n in a genome or

Fig 2 The negative cumulative GC profile for GGA14 marked with the segmentation points obtained The bottom four plots show the distributions of the G + C content and CpG islands along chicken chromosome 14, respectively The G + C contents are calculated for the domains segmented at t 0 ¼ 1000, 500, and 100, respect-ively Note that the distribution of CpG islands is closely correlated with the segmented regions with distinct G + C content The nota-tion used here is described as follows Besides the posinota-tion coordi-nates, the order of occurrence for each point in the segmentation process is also labeled in the figure We used ‘f’, ‘l’, ‘r’, and an inte-ger to label the order of occurrence, where f denotes the first point occurring during the course of segmentation, and l and r denote that the point occurs in the left and right subsequence, respect-ively The integer denotes the times of segmentation For example,

in point 12579268-rl 2 4, the first part, 12579268, is the position coordinate The second part, rl 2 4, denotes the order of occurrence The last integer, 4, in the second part means that this point occurs after four segmentations In the symbol rl2, l appears twice, so we used ‘l 2 ’ instead of ‘ll’ for convenience Also note that the coordi-nate value of each segmentation point has been corrected by tak-ing the gap length into account For instance, there is a gap occurring at n0 ﬁ n 0 + D, where D is the gap length If a segmen-tation point obtained is situated at n, and n > n0, then the actual coordinate of n adopted in this plot is n + D Meanwhile, the gap region n0 ﬁ n 0 + D is represented by a blank interval in this plot Here, n0and n are the relative coordinates with respect to the con-tig without gaps Other gaps are dealt with using similar procedure.

Trang 5

chromosome Before studying the features of the

cumulative GC proﬁles of the chicken genome, some

basic characteristics of the cumulative GC proﬁle need

to be addressed It was shown that the average G + C

content of a genome or chromosome at position

nﬁ n + Dn is calculated by G þ C / Dðz0

nÞ=Dn [16]

Therefore, a jump in the z0

n curve indicates an increase of the G + C content; whereas a drop down

in the z0

n curve indicates a decrease of the G + C

content An approximate straight region in the z0

n

curve implies that the G + C content in this region is

roughly constant In addition, the segmentation point obtained here is exactly a turning point of the G + C content, which corresponds to an extreme point in the cumulative GC profile [15] Therefore, the segmenta-tion coordinates may be used to annotate the related cumulative GC profile, presenting researchers an intu-itive picture Consequently, the coordinates of segmen-tation points for 24 chicken chromosome sequences were labeled on the cumulative GC profiles, which are accessible at http://tubic.tju.edu.cn/chicken/

Analysis of the identified isochores showed that these isochores correspond to an approximately straight line in the –z’ curves, a reflection of the fact that the G + C contents in these regions are fairly homogenous We also found that these regions often correspond to meaningful biological units For exam-ple, at t0¼ 100 level, only three segmented domains (isochores 4, 8 and 9 in Table 2) longer than 4 Mb were identified on GGA1 These domains are located

on the long arm of GGA1, corresponding to regions with high-repeat density and low-gene density [17] For two of them (isochores 8 and 9 in Table 2), only approximate coordinates between 140 and

160 Mb were given in [17] Here, the precise bound-aries, sizes, and G + C contents of these isochores have been determined using the present method (Table 2)

As shown in Figs 2, 4 and 5, the obtained segmenta-tion points have clear biological implicasegmenta-tions Note that the distribution of CpG islands is closely correla-ted with the segmencorrela-ted regions with distinct G + C content We therefore investigated the correlation between the G + C content of isochores and the dis-tribution of CpG islands throughout the chicken gen-ome (Fig 6) With t0¼ 100, only a total of 811 segments longer than 300 kb were considered as iso-chores, according to our deﬁnition of an isochore (Table 1) It was shown that there are positive and highly signiﬁcant correlations between the G + C con-tent of these isochores and the corresponding density distribution of CpG islands (R¼ 0.82, P < 0.001) The positive correlation between the G + C content and the density distribution of CpG islands is a well-known fact It is therefore worth pointing out that the segmentation points obtained here are exactly the boundaries of the related regions For example, there

is an abrupt increase (decrease) of the density of CpG islands at the ﬁrst (second) boundary of the short GC-rich region between 15 908 133 and 16 385 348 nucleotide on GGA12 (Fig 4) Similar phenomena are observed in other G + C distinct regions

The precise boundary coordinates obtained by the segmentation algorithm and the associated cumulative

Fig 3 Histogram of length and G + C content based on all the

seg-ments obtained at t 0 ¼ 100 without the constraint of the minimum

length in the draft genome sequence of chicken (A) The length

dis-tribution of all the obtained segments The length disdis-tribution is

notably skewed, with the highest value being 10.5 Mb,

correspond-ing to a region with high-repeat density and low-gene density on

GGA1 (B) The G + C content distribution of all the obtained

seg-ments It shows that the G + C content distribution is also highly

skewed, with a long tail of GC-rich regions.

Trang 6

GC proﬁle provide a useful platform to analyze a

gen-ome or chromosgen-ome For instance, any gene-ﬁnding

algorithm would beneﬁt from these boundary

coordi-nates To gain better gene-ﬁnding results, different

parameters would be adopted in a gene-ﬁnding

algo-rithm by considering different regions of distinct

G + C content with precise boundary coordinates In

[1], an evidence-based system (Ensembl [18]) and two

comparative gene prediction methods (twinscan [19]

and SGP-2 [20]) were applied to chicken gene

predic-tion, and the overall performances of these methods

have been evaluated in terms of sensitivity and

speciﬁc-ity indices Here, the distribution of gene densspeciﬁc-ity is

an-alyzed based on the prediction results, respectively We

can see from Fig 4 that the density distribution of the

predicted genes is also correlated with the segmented

regions with distinct G + C content Based on the

cumulative GC proﬁle, the performance of these

meth-ods even can be assessed for a certain region in an

intuitive form As gene density is positively correlated

with G + C content and CpG island density, it seems

that the gene density predicted by SGP-2 is more

rea-sonable than that predicted by Ensembl and twinscan

at the region between 15 908 133 and 16 385 348

nucleotide on GGA12, based on Fig 4

The obtained isochore map can also be displayed in the UCSC Genome Browser as a custom track, together with a series of tracks aligned with the genomic sequence [21] As an example, the top track in Fig 5 shows the isochore structure of chicken chromosome 28, integra-ted with comprehensive genome information, such as the G + C content, isochores from Pennsylvania State University (PSU) [22], gene density predicted by Ensembl, CpG islands, best alignments with the human genome, single nucleotide polymorphisms (SNPs) and repeat densities This graphical interface allows rapid visual inspection of the correlation of different types of information [21] Note that the density distributions of CpG islands and genes are correlated with the segmen-ted regions with distinct G + C content Here, the region from 2 021 043 to 2 644 230 nucleotide was deemed as an isochore (with length¼ 623 kb), which is the longest region among the obtained segments on GGA28 The G + C content of this isochore is 37.08%, the lowest G + C content among the identiﬁed iso-chores It is clearly shown that this isochore corresponds

to a desert region of genes⁄ CpG islands ⁄ SNPs and con-tains high-density simple tandem repeats It can also be seen from Fig 5 that our result is more reasonable than that obtained from PSU The isochore data from PSU

Table 2 The identified isochores longer than 2 Mb (excluding gaps) in the chicken genome at t0¼ 100 nt, nucleotide.

Trang 7

were generated based on the methods described in

[22], in which a measure, compositional heterogeneity

(or variability) index, was proposed to compare the

dif-ferences in compositional heterogeneity between long

genomic sequences It seems that there is something

wrong with the boundary coordinates of the isochores

identiﬁed from PSU For example, the region from

1 935 001 to 2 075 000 nucleotide was deemed as an

isochore in the result from PSU, while both the

cumula-tive GC proﬁle for GGA28 (Fig 1) and G + C content

in ﬁve-base windows clearly showed an abrupt change

in the G + C content within this region

Based on the present method, other chicken chromo-somes were also analyzed, the detailed analysis for which is accessible at http://tubic.tju.edu.cn/chicken/ The program of the new segmentation algorithm is also available on request

Comparison with the other segmentation algorithms

Traditionally, the G + C content distribution of a genome is usually assessed by computing the G + C content in sliding windows moving along the genome

Fig 4 The negative cumulative GC profile

for GGA12 marked with the segmentation

points obtained The bottom five plots show

the distributions of G + C content, genes

and CpG islands along chicken chromosome

12, respectively Here, the distribution of

gene density is plotted based on the

predic-ted results by SGP-2, Ensembl and

TWINSCAN , respectively Note that the density

distributions of the predicted genes are also

correlated with the segmented regions with

distinct G + C content However, it seems

that the gene density predicted by SGP-2 is

more reasonable than that predicted by

Ensembl and TWINSCAN at the region

between 15 908 133 and 16 385 348

nucleotides, respectively The notation used

here is the same as that in Fig 2 For the

details about the notation, refer to the

legend of Fig 2 Also note that there are a

number of larger or smaller gaps in GGA12.

Here, only gaps >1% of the chromosome

size were retained; gaps <1% of the

chromosome size were simply deleted.

Consequently, GGA12 was split into two

contigs The superscript in front of the

position coordinates is used to denote

which contig the segmentation point

belongs to.

Trang 8

Fig 5 UCSC Genome Browser on chicken chromosome 28 with our own custom annotation track The top track shows the obtained iso-chore map integrated with comprehensive genome information, such as the G + C content, isoiso-chores from Pennsylvania State University, gene density predicted by Ensembl, CpG islands, best alignments with the human genome, single nucleotide polymorphisms (SNPs) and repeat densities Here, the obtained segments longer than 50 kb at t 0 ¼ 100 are displayed at the UCSC Genome Browser as a custom track These segments are represented by rectangular blocks, and the corresponding G + C contents are labeled on the left of the segments Seg-ments with higher G + C content are more darkly shaded The precise boundary coordinates can be found at http://tubic.tju.edu.cn/chicken/ The region from 2021 043 to 2644 230 nucleotide was identified as an isochore, with the lowest G + C content (37.08%) among the obtained segments on GGA28 It is clearly shown that this isochore corresponds to a desert region of genes ⁄ CpG islands ⁄ SNPs and contains high-density simple tandem repeats Note that there are abrupt changes in the density distributions of CpG islands, genes and other elements at the boundaries of this isochore identified by the present algorithm.

Trang 9

The disadvantage of this routinely used window-based

method is that the resolution is low, e.g the method is

not sensitive in detecting the small changes in the

G + C content In addition, the distribution pattern

of G + C content obtained is largely dependent on

the window size

Historically, other windowless methods have been

developed to calculate the G + C content, which are

usually given the name of ‘segmentation of DNA

sequences’ Among them, the methods of entropic

seg-mentation [23,24], hidden Markov model [25,26] and

wavelet shrinkage technique [27] should be mentioned

The advantages and disadvantages of the latter two

methods were discussed in [28] As the entropic

seg-mentation algorithm is widely used to ﬁnd

segmenta-tion points for various genomes, one may wonder if

the two algorithms (the entropic and our algorithm)

result in the same or different results Therefore, it is

interesting to compare the two segmentation

algo-rithms Here, we focus the comparison only with the

entropic segmentation algorithm Both segmentation

algorithms possess the highest resolution (single

nuc-leotide accuracy) By applying the new algorithm to

the chicken chromosome sequences, the coordinates of

segmentation points obtained are completely identical

to those derived from the entropic segmentation

algo-rithm (data not shown here)

Compared with the entropic segmentation algorithm,

the new algorithm has a series of merits First, the new

algorithm is simpler and faster than the entropy-based algorithm Secondly, the new algorithm is based on the genome order index S, which has a clear geometrical meaning, i.e it is a square of a Euclidean distance [29] Thirdly, S possesses clear biological implications, e.g

Susually has different values in coding and noncoding regions, which has been used to recognize protein-cod-ing genes in the buddprotein-cod-ing yeast genome [30] Finally, the new segmentation algorithm is superior to the entropic one in that the former is able to provide an intuitive picture by incorporating with the Z-curve rep-resentation of DNA sequences [31] The segmentation point obtained here is exactly a turning point of the

G + C content, which corresponds to an extreme point in the cumulative GC proﬁle Consequently, we may use the segmentation coordinates to annotate the related cumulative GC proﬁle, presenting researchers with an intuitive picture

Conclusions

Delineating compositionally homogeneous G + C domains in DNA sequences can provide much insight into the understanding of the organization and biologi-cal functions of a given genome Compositionally homogeneous segments of genomic DNA have been shown to correlate to a number of important genomic features Furthermore, quantitative analysis of compo-sitional heterogeneity reveals the statistical properties

of DNA sequences, which is useful to locate the origin and terminus of replication in bacterial [32] and archa-eal [33] genomes, and detect horizontally transferred genes and genomic islands [28]

In this paper, it has been shown that the chicken genome is organized into a mosaic structure of iso-chores A new algorithm has been applied to segment

24 chicken chromosome sequences, and the boundaries

of isochores obtained for each chromosome have been determined precisely

In summary, the cumulative GC profile marked with the coordinates of resulting segmentation points is a useful tool for genome analysis This leads to a neat graphical representation of G + C content variations along a genome or chromosome, and a clear-cut defini-tion of isochores This technique allowed us to show⁄ confirm that GC-rich isochores in a chicken chromosome have higher gene and CpG-islands densi-ties than AT-rich isochores Although these are well-known characteristics of isochores of the vertebrate organisms, the advantage of the technique is that an investigator is able to study all of these in a perceiv-able and precise manner We believe that a plot similar

to Fig 4 could become a common tool for analyzing

Fig 6 Correlation between the G + C content of isochore and the

density distribution of CpG islands With t 0 ¼ 100, only a total of

811 segments longer than 300 kb were considered as isochores

according to the definition of isochore Consequently, the

correl-ation coefficient and equcorrel-ation of the linear regression line were

given in the plot It shows there are positive and highly significant

correlations between the G + C content of these isochores and the

corresponding density distribution of CpG islands (R ¼ 0.82,

P < 0.001).

Trang 10

the G + C content variations for any genome or

chro-mosome For higher eukaryotic genomes, the

cumula-tive GC proﬁle equipped with the new segmentation

algorithm would be an appropriate starting point for

analyzing their isochore structures

Experimental procedures

The draft chicken genome sequence, release galGal2, and

its associated annotation ﬁles, such as the data of gene,

CpG island, SNPs, isochores from PSU, best alignments

with the human genome and so on, were downloaded from

http://genome.ucsc.edu/ In the present study, we follow the

convention of the International Chicken Genome

Sequen-cing Consortium (ICGSC 2004) by classifying chicken

(GGA1-5), ﬁve intermediate chromosomes (GGA6-10) and

28 microchromosomes (GGA11-38) Here, sex chromosome

W and microchromosomes smaller than GGA28 were

excluded from the study Our analysis of the distributions

of G + C content, CpG islands, and genes was only

restricted to the remaining 24 chromosomes The densities

of CpG islands and genes were calculated in 100 kb long,

nonoverlapping windows

A new segmentation algorithm of DNA

sequences

The genome order index S is deﬁned by

S¼ SðPÞ ¼ a2þ c2þ g2þ t2 ð1Þ

where a, c, g and t denote the occurrence frequencies of

A, C, G and T, respectively, in a genome or a DNA

sequence The genome order index S deﬁned in Eqn 1 is

a useful statistical quantity to reﬂect the compositional

characteristics of a genome [29], which can serve as an

appropriate divergence measure to quantify the

composi-tional difference between two DNA sequences [15] The

new segmentation algorithm proposed here is based on

the quadratic divergence (see Eqn 2) Consider a genome

with N bases Let n be an integer, 2£ n £ N – 1 For a

given n, the genome sequence is partitioned into two

sub-sequences, one left and the other right Let w1¼ n ⁄ N

and w2¼ (N) n) ⁄ N Let Pl¼ (al,cl,gl,tl) and Pr¼

(ar,cr,gr,tr), where al,cl,gl,tl and ar,cr,gr,tr are the

occur-rence frequencies of bases A, C, G and T in the left and

right subsequences, respectively Thus,

DSðPl;PrÞ ¼ ðn=NÞSðPlÞ þ ½ðN nÞ=NSðPrÞ

Sfðn=NÞPlþ ½ðN nÞ=NPrg; ð2Þ where S(P) is deﬁned by Eqn 1 If we suppose that n* is a

position, at which DS(Pl,Pr) reaches maximum, then n* is

a compositional segmentation point of the genome ﬁrst

found The new algorithm is also recursive, as in [23] and

[24], i.e after n* is determined, the same procedure is applied to both the resulting left and right subsequences, respectively The procedure should be applied recursively until DS(Pl,Pr) is less than a given threshold

However, a question which needs to be answered is the halting condition of the segmentation algorithm This is done by deﬁning a halting parameter, t

where N is the length of sequence or subsequence to be seg-mented If t < t0, the segmentation procedure halts, other-wise, the procedure continues until t < t0 As we are only interested in segmenting concrete genomes, the choice of t0

is based on a heuristic consideration A larger threshold t0

leads to less segmentation points and longer segmented sub-sequences, whereas a smaller threshold t0leads to more seg-mentation points and shorter segmented subsequences For

an obtained segmentation point, it is important to know whether the halting parameter value is significantly different from that of a random sequence In order to halt the seg-mentation at different significance levels, we estimated the distribution of the halting parameter based on 100 000 ran-dom sequences with length of 1 Mb For each of these sequences, we calculated a halting parameter for the first point occurring during the course of segmentation and obtained thus 100 000 numbers Consequently, cumulative frequency and counts were plotted against the halting parameter, respectively (Fig 7) For example, if the signifi-cance level is 5% then t0corresponds to 6.194 However, a much more stringent stopping criterion is actually required

in most cases It should be noted that in some cases the segmentation procedure also halts when the resulting subse-quence is shorter than a given minimum length Here, we choose 3000 nucleotide as the minimum length according to

a requirement imposed by the experimental characterization

of isochores through DNA centrifugation [3] In general, the choice of t0 and the minimum length is heuristic and must be determined on a case by case basis [15]

Cumulative GC profile

znis deﬁned as

zn¼ ðAnþ TnÞ ðCnþ GnÞ; n ¼ 0; 1; 2; :::; N; zn2 ½N; N; ð4Þ where An, Cn, Gn, and Tn are the cumulative numbers of the bases A, C, G and T, respectively, occurring in the subsequence from the ﬁrst base to the n-th base in the DNA sequence inspected Here, zn is one of the compo-nents of the Z-curve, which is a three dimensional curve that uniquely represents a DNA sequence [34,35] Usu-ally, for an AT-rich (GC-rich) genome, zn is approxi-mately a monotonously increasing (decreasing) linear function of n To amplify the deviations of zn, the curve

of zn n is ﬁtted by a straight line using the least squares technique,

Tiêu đề	Isochore structures in the chicken genome
Tác giả	Feng Gao, Chun-Ting Zhang
Trường học	Tianjin University
Chuyên ngành	Physics
Thể loại	Research article
Năm xuất bản	2006
Thành phố	Tianjin

Định dạng
Số trang	12
Dung lượng	566,3 KB