By applying the new segmentation algorithm to the draft chicken genome sequence, the mosaic organization of the chicken genome can be confirmed at the sequence level.. By applying the seg
Trang 1Feng Gao and Chun-Ting Zhang
Department of Physics, Tianjin University, China
The first draft genome sequence of the red jungle
fowl, Gallus gallus, was published in December 2004
The chicken (G gallus) is an important model
organ-ism that bridges the evolutionary gap between
mam-mals and other vertebrates and serves as a main
laboratory model for the 9600 extant avian species
The chicken also represents the first agricultural
ani-mal to have its genome sequenced Like most bird
species, the chicken has a relatively small genome of
1200 million base pairs, or 39% of the size of
the human genome [1]
The nuclear genomes of vertebrates are mosaics of
isochores, very long stretches [> 300 kilobases (kb)] of
DNA that are fairly homogeneous in base
composi-tion Isochores can be partitioned into a small number
of families that cover a range of GC levels, which is
narrow in cold-blooded vertebrates, but broad in warm-blooded vertebrates [2,3] The large-scale vari-ation in base composition correlates both coding and noncoding sequences and seems to reflect a fundamen-tal level of genome organization [4] This isochore organization shows marked variation in a number of important genomic features, including gene density [5], chromosome bands [6,7], patterns of codon usage [8], gene length [9], replication timing [10], recombination rate [11,12], and the distribution of transposable ele-ments [13] By in situ hybridization of fractionated DNA on mitotic and meiotic chromosomes, a com-positional map of chicken chromosomes has been obtained and the most gene-rich regions have been studied [14] Now, the availability of the complete chicken genome sequence provides an unprecedented
Keywords
compositional homogeneity; compositional
segmentation; Gallus gallus; isochores;
windowless technique
Correspondence
C.-T Zhang, Department of Physics, Tianjin
University, Tianjin 300072, China
Fax: +86 22 27402697
Tel: +86 22 27402987
E-mail: ctzhang@tju.edu.cn
(Received 13 November 2005, revised 5
January 2006, accepted 14 February 2006)
doi:10.1111/j.1742-4658.2006.05178.x
The availability of the complete chicken genome sequence provides an unprecedented opportunity to study the global genome organization at the sequence level Delineating compositionally homogeneous G + C domains
in DNA sequences can provide much insight into the understanding of the organization and biological functions of the chicken genome A new seg-mentation algorithm, which is simple and fast, has been proposed to parti-tion a given genome or DNA sequence into composiparti-tionally distinct domains By applying the new segmentation algorithm to the draft chicken genome sequence, the mosaic organization of the chicken genome can be confirmed at the sequence level It is shown herein that the chicken genome
is also characterized by a mosaic structure of isochores, long DNA seg-ments that are fairly homogeneous in the G + C content Consequently,
25 isochores longer than 2 Mb (megabases) have been identified in the chicken genome These isochores have a fairly homogeneous G + C con-tent and often correspond to meaningful biological units With the aid of the technique of cumulative GC profile, we proposed an intuitive picture
to display the distribution of segmentation points The relationships between G + C content and the distributions of genes (CpG islands, and other genomic elements) were analyzed in a perceivable manner The cumu-lative GC profile, equipped with the new segmentation algorithm, would be
an appropriate starting point for analyzing the isochore structures of higher eukaryotic genomes
Abbreviations
SNP, single nucleotide polymorphism.
Trang 2opportunity to study the global genome organization
at the sequence level
In this article, we analyzed the isochore structures of
the chicken genome using a new segmentation
algo-rithm [15] By applying the segmentation algoalgo-rithm to
24 chicken chromosome sequences, the boundaries of
isochores for each chromosome were obtained,
respect-ively It was found that the chicken genome is
organized into a mosaic structure of isochores
Conse-quently, 25 isochores longer than 2 Mb have been
identified, i.e eight GC-rich isochores and 17 GC-poor
isochores
Results and discussion
The isochores in the chicken genome
It should be noted that the chicken genome sequence
still contains a large number of gaps (Table 1) In the
case of GGA1, there are 9847 gaps remaining There-fore, applying the segmentation algorithm to each frag-ment will fail to unveil the characteristic of the whole genome In order to display the global G + C content distribution along chromosomes, only gaps > 1% of the chromosome size were retained; gaps < 1% of the chromosome size were simply deleted By applying the segmentation algorithm to the resulting contigs of each chromosome, the segmentation points were obtained at
a certain threshold t0, respectively At a given thresh-old t0, the number of resulting segmentation points can reflect the compositional homogeneity of the sequences For instance, the size of GGA6 is similar to that of GGAZ At the same threshold t0¼ 100, there are 161 segmentation points in GGA6, while there are only 58 segmentation points in GGAZ This indicates that GGAZ sequence is more homogeneous than GGA6, and this is also confirmed by Fig 1 The varia-tions of the cumulative GC profile for GGA6 are
Table 1 The summary statistics in the chicken genome The number of isochores longer than 300 kb obtained at t 0 ¼ 100 in each chromo-some is also presented in the table.
Chromosome
Chromosome size (bp)
Number
of gaps
Percent of gaps in the chromosome (%)
G + C content (%)
Number of isochores
Trang 3much larger than those of the cumulative GC profile
for GGAZ
Here, t0 was chosen with the aid of the cumulative
GC profile and the density distribution of CpG
islands For example, there are 14, 20, and 148
seg-mentation points obtained on GGA14 with t0 set at
1000, 500, and 100, respectively As shown in Fig 2,
the domains obtained can delineate the variations of the cumulative GC profile and the density distribution
of CpG islands more and more accurately with decreasing t0 On the other hand, a smaller t0leads to more segmentation points and shorter segmented sub-sequences Similar procedures were carried out for macrochromosomes, intermediate chromosomes and
Fig 1 The negative cumulative GC profiles for the chicken genome The gaps in the chicken chromosome sequences are left empty in the curves Note that sharp peaks correspond to the sites where G + C content undergoes abrupt changes, from GC-rich regions to GC-poor regions, and vice versa, indicating a mosaic structure of the chromosomes A jump in the z 0
n curve indicates an increase of the G + C con-tent; whereas a drop down in the z 0
n curve indicates a decrease of the G + C content An approximate straight region in the z 0
n curve implies that the G + C content in this region is roughly constant.
Trang 4sex chromosome Z, respectively Consequently, for
macrochromosomes, intermediate chromosomes and
sex chromosome Z, the threshold t0 is set to 1000 to
partition these chromosomes into compositionally
dis-tinct domains For microchromosomes, which are
much smaller and contain higher density of CpG
islands and genes, t0¼ 500 is adopted in order to
reflect more details Finally, t0¼ 100 is used as a
threshold to identify isochores in the chicken genome
Here, the region from 12 579 268–13 821 432
nucleo-tide on GGA14 was deemed as an isochore
The distributions of length and G + C content are
presented in Fig 3, based on all the segments obtained
at t0 ¼ 100 without the constraint of the minimum
length It can be seen that the length distribution is
notably skewed, with the highest value being 10.5 Mb,
corresponding to a region with high-repeat density and
low-gene density on GGA1 The G + C content
distri-bution is also highly skewed, with a long tail of
GC-rich regions It should be noted that the view of
the chicken genome we now have from the sequence
may still be a compositionally biased one, as some of the most GC-rich, CpG-island-rich regions, namely several microchromosomes such as chromosomes 25,
29, 30, or 31, are essentially missing from the sequence
in the currently available chicken genome draft Consequently, 25 isochores longer than 2 Mb (exclu-ding gaps) were identified (Table 2), i.e eight GC-rich isochores and 17 poor isochores In general, GC-rich isochores tend to be shorter than GC-poor ones The classification of isochores adopted here was pro-posed by Zhang and Zhang [16], which is based on the relative magnitude of the G + C content of isochores with respect to the genomic G + C content Accord-ing to this classification, the G + C content of GC-rich isochores (GC-poor isochores) is higher (lower) than the genomic G + C content
Biological implications of isochores With the aid of the technique of cumulative GC pro-file, we proposed an intuitive picture to display the dis-tribution of segmentation points The relationships between G + C content and the distributions of genes (CpG islands, and other genomic elements) can be an-alyzed in a perceivable manner The cumulative GC profile is also called the zn0 curve, which is a discrete function of the nucleotide position n in a genome or
Fig 2 The negative cumulative GC profile for GGA14 marked with the segmentation points obtained The bottom four plots show the distributions of the G + C content and CpG islands along chicken chromosome 14, respectively The G + C contents are calculated for the domains segmented at t 0 ¼ 1000, 500, and 100, respect-ively Note that the distribution of CpG islands is closely correlated with the segmented regions with distinct G + C content The nota-tion used here is described as follows Besides the posinota-tion coordi-nates, the order of occurrence for each point in the segmentation process is also labeled in the figure We used ‘f’, ‘l’, ‘r’, and an inte-ger to label the order of occurrence, where f denotes the first point occurring during the course of segmentation, and l and r denote that the point occurs in the left and right subsequence, respect-ively The integer denotes the times of segmentation For example,
in point 12579268-rl 2 4, the first part, 12579268, is the position coordinate The second part, rl 2 4, denotes the order of occurrence The last integer, 4, in the second part means that this point occurs after four segmentations In the symbol rl2, l appears twice, so we used ‘l 2 ’ instead of ‘ll’ for convenience Also note that the coordi-nate value of each segmentation point has been corrected by tak-ing the gap length into account For instance, there is a gap occurring at n0 fi n 0 + D, where D is the gap length If a segmen-tation point obtained is situated at n, and n > n0, then the actual coordinate of n adopted in this plot is n + D Meanwhile, the gap region n0 fi n 0 + D is represented by a blank interval in this plot Here, n0and n are the relative coordinates with respect to the con-tig without gaps Other gaps are dealt with using similar procedure.
Trang 5chromosome Before studying the features of the
cumulative GC profiles of the chicken genome, some
basic characteristics of the cumulative GC profile need
to be addressed It was shown that the average G + C
content of a genome or chromosome at position
nfi n + Dn is calculated by G þ C / Dðz0
nÞ=Dn [16]
Therefore, a jump in the z0
n curve indicates an increase of the G + C content; whereas a drop down
in the z0
n curve indicates a decrease of the G + C
content An approximate straight region in the z0
n
curve implies that the G + C content in this region is
roughly constant In addition, the segmentation point obtained here is exactly a turning point of the G + C content, which corresponds to an extreme point in the cumulative GC profile [15] Therefore, the segmenta-tion coordinates may be used to annotate the related cumulative GC profile, presenting researchers an intu-itive picture Consequently, the coordinates of segmen-tation points for 24 chicken chromosome sequences were labeled on the cumulative GC profiles, which are accessible at http://tubic.tju.edu.cn/chicken/
Analysis of the identified isochores showed that these isochores correspond to an approximately straight line in the –z’ curves, a reflection of the fact that the G + C contents in these regions are fairly homogenous We also found that these regions often correspond to meaningful biological units For exam-ple, at t0¼ 100 level, only three segmented domains (isochores 4, 8 and 9 in Table 2) longer than 4 Mb were identified on GGA1 These domains are located
on the long arm of GGA1, corresponding to regions with high-repeat density and low-gene density [17] For two of them (isochores 8 and 9 in Table 2), only approximate coordinates between 140 and
160 Mb were given in [17] Here, the precise bound-aries, sizes, and G + C contents of these isochores have been determined using the present method (Table 2)
As shown in Figs 2, 4 and 5, the obtained segmenta-tion points have clear biological implicasegmenta-tions Note that the distribution of CpG islands is closely correla-ted with the segmencorrela-ted regions with distinct G + C content We therefore investigated the correlation between the G + C content of isochores and the dis-tribution of CpG islands throughout the chicken gen-ome (Fig 6) With t0¼ 100, only a total of 811 segments longer than 300 kb were considered as iso-chores, according to our definition of an isochore (Table 1) It was shown that there are positive and highly significant correlations between the G + C con-tent of these isochores and the corresponding density distribution of CpG islands (R¼ 0.82, P < 0.001) The positive correlation between the G + C content and the density distribution of CpG islands is a well-known fact It is therefore worth pointing out that the segmentation points obtained here are exactly the boundaries of the related regions For example, there
is an abrupt increase (decrease) of the density of CpG islands at the first (second) boundary of the short GC-rich region between 15 908 133 and 16 385 348 nucleotide on GGA12 (Fig 4) Similar phenomena are observed in other G + C distinct regions
The precise boundary coordinates obtained by the segmentation algorithm and the associated cumulative
Fig 3 Histogram of length and G + C content based on all the
seg-ments obtained at t 0 ¼ 100 without the constraint of the minimum
length in the draft genome sequence of chicken (A) The length
dis-tribution of all the obtained segments The length disdis-tribution is
notably skewed, with the highest value being 10.5 Mb,
correspond-ing to a region with high-repeat density and low-gene density on
GGA1 (B) The G + C content distribution of all the obtained
seg-ments It shows that the G + C content distribution is also highly
skewed, with a long tail of GC-rich regions.
Trang 6GC profile provide a useful platform to analyze a
gen-ome or chromosgen-ome For instance, any gene-finding
algorithm would benefit from these boundary
coordi-nates To gain better gene-finding results, different
parameters would be adopted in a gene-finding
algo-rithm by considering different regions of distinct
G + C content with precise boundary coordinates In
[1], an evidence-based system (Ensembl [18]) and two
comparative gene prediction methods (twinscan [19]
and SGP-2 [20]) were applied to chicken gene
predic-tion, and the overall performances of these methods
have been evaluated in terms of sensitivity and
specific-ity indices Here, the distribution of gene densspecific-ity is
an-alyzed based on the prediction results, respectively We
can see from Fig 4 that the density distribution of the
predicted genes is also correlated with the segmented
regions with distinct G + C content Based on the
cumulative GC profile, the performance of these
meth-ods even can be assessed for a certain region in an
intuitive form As gene density is positively correlated
with G + C content and CpG island density, it seems
that the gene density predicted by SGP-2 is more
rea-sonable than that predicted by Ensembl and twinscan
at the region between 15 908 133 and 16 385 348
nucleotide on GGA12, based on Fig 4
The obtained isochore map can also be displayed in the UCSC Genome Browser as a custom track, together with a series of tracks aligned with the genomic sequence [21] As an example, the top track in Fig 5 shows the isochore structure of chicken chromosome 28, integra-ted with comprehensive genome information, such as the G + C content, isochores from Pennsylvania State University (PSU) [22], gene density predicted by Ensembl, CpG islands, best alignments with the human genome, single nucleotide polymorphisms (SNPs) and repeat densities This graphical interface allows rapid visual inspection of the correlation of different types of information [21] Note that the density distributions of CpG islands and genes are correlated with the segmen-ted regions with distinct G + C content Here, the region from 2 021 043 to 2 644 230 nucleotide was deemed as an isochore (with length¼ 623 kb), which is the longest region among the obtained segments on GGA28 The G + C content of this isochore is 37.08%, the lowest G + C content among the identified iso-chores It is clearly shown that this isochore corresponds
to a desert region of genes⁄ CpG islands ⁄ SNPs and con-tains high-density simple tandem repeats It can also be seen from Fig 5 that our result is more reasonable than that obtained from PSU The isochore data from PSU
Table 2 The identified isochores longer than 2 Mb (excluding gaps) in the chicken genome at t0¼ 100 nt, nucleotide.
Trang 7were generated based on the methods described in
[22], in which a measure, compositional heterogeneity
(or variability) index, was proposed to compare the
dif-ferences in compositional heterogeneity between long
genomic sequences It seems that there is something
wrong with the boundary coordinates of the isochores
identified from PSU For example, the region from
1 935 001 to 2 075 000 nucleotide was deemed as an
isochore in the result from PSU, while both the
cumula-tive GC profile for GGA28 (Fig 1) and G + C content
in five-base windows clearly showed an abrupt change
in the G + C content within this region
Based on the present method, other chicken chromo-somes were also analyzed, the detailed analysis for which is accessible at http://tubic.tju.edu.cn/chicken/ The program of the new segmentation algorithm is also available on request
Comparison with the other segmentation algorithms
Traditionally, the G + C content distribution of a genome is usually assessed by computing the G + C content in sliding windows moving along the genome
Fig 4 The negative cumulative GC profile
for GGA12 marked with the segmentation
points obtained The bottom five plots show
the distributions of G + C content, genes
and CpG islands along chicken chromosome
12, respectively Here, the distribution of
gene density is plotted based on the
predic-ted results by SGP-2, Ensembl and
TWINSCAN , respectively Note that the density
distributions of the predicted genes are also
correlated with the segmented regions with
distinct G + C content However, it seems
that the gene density predicted by SGP-2 is
more reasonable than that predicted by
Ensembl and TWINSCAN at the region
between 15 908 133 and 16 385 348
nucleotides, respectively The notation used
here is the same as that in Fig 2 For the
details about the notation, refer to the
legend of Fig 2 Also note that there are a
number of larger or smaller gaps in GGA12.
Here, only gaps >1% of the chromosome
size were retained; gaps <1% of the
chromosome size were simply deleted.
Consequently, GGA12 was split into two
contigs The superscript in front of the
position coordinates is used to denote
which contig the segmentation point
belongs to.
Trang 8Fig 5 UCSC Genome Browser on chicken chromosome 28 with our own custom annotation track The top track shows the obtained iso-chore map integrated with comprehensive genome information, such as the G + C content, isoiso-chores from Pennsylvania State University, gene density predicted by Ensembl, CpG islands, best alignments with the human genome, single nucleotide polymorphisms (SNPs) and repeat densities Here, the obtained segments longer than 50 kb at t 0 ¼ 100 are displayed at the UCSC Genome Browser as a custom track These segments are represented by rectangular blocks, and the corresponding G + C contents are labeled on the left of the segments Seg-ments with higher G + C content are more darkly shaded The precise boundary coordinates can be found at http://tubic.tju.edu.cn/chicken/ The region from 2021 043 to 2644 230 nucleotide was identified as an isochore, with the lowest G + C content (37.08%) among the obtained segments on GGA28 It is clearly shown that this isochore corresponds to a desert region of genes ⁄ CpG islands ⁄ SNPs and contains high-density simple tandem repeats Note that there are abrupt changes in the density distributions of CpG islands, genes and other elements at the boundaries of this isochore identified by the present algorithm.
Trang 9The disadvantage of this routinely used window-based
method is that the resolution is low, e.g the method is
not sensitive in detecting the small changes in the
G + C content In addition, the distribution pattern
of G + C content obtained is largely dependent on
the window size
Historically, other windowless methods have been
developed to calculate the G + C content, which are
usually given the name of ‘segmentation of DNA
sequences’ Among them, the methods of entropic
seg-mentation [23,24], hidden Markov model [25,26] and
wavelet shrinkage technique [27] should be mentioned
The advantages and disadvantages of the latter two
methods were discussed in [28] As the entropic
seg-mentation algorithm is widely used to find
segmenta-tion points for various genomes, one may wonder if
the two algorithms (the entropic and our algorithm)
result in the same or different results Therefore, it is
interesting to compare the two segmentation
algo-rithms Here, we focus the comparison only with the
entropic segmentation algorithm Both segmentation
algorithms possess the highest resolution (single
nuc-leotide accuracy) By applying the new algorithm to
the chicken chromosome sequences, the coordinates of
segmentation points obtained are completely identical
to those derived from the entropic segmentation
algo-rithm (data not shown here)
Compared with the entropic segmentation algorithm,
the new algorithm has a series of merits First, the new
algorithm is simpler and faster than the entropy-based algorithm Secondly, the new algorithm is based on the genome order index S, which has a clear geometrical meaning, i.e it is a square of a Euclidean distance [29] Thirdly, S possesses clear biological implications, e.g
Susually has different values in coding and noncoding regions, which has been used to recognize protein-cod-ing genes in the buddprotein-cod-ing yeast genome [30] Finally, the new segmentation algorithm is superior to the entropic one in that the former is able to provide an intuitive picture by incorporating with the Z-curve rep-resentation of DNA sequences [31] The segmentation point obtained here is exactly a turning point of the
G + C content, which corresponds to an extreme point in the cumulative GC profile Consequently, we may use the segmentation coordinates to annotate the related cumulative GC profile, presenting researchers with an intuitive picture
Conclusions
Delineating compositionally homogeneous G + C domains in DNA sequences can provide much insight into the understanding of the organization and biologi-cal functions of a given genome Compositionally homogeneous segments of genomic DNA have been shown to correlate to a number of important genomic features Furthermore, quantitative analysis of compo-sitional heterogeneity reveals the statistical properties
of DNA sequences, which is useful to locate the origin and terminus of replication in bacterial [32] and archa-eal [33] genomes, and detect horizontally transferred genes and genomic islands [28]
In this paper, it has been shown that the chicken genome is organized into a mosaic structure of iso-chores A new algorithm has been applied to segment
24 chicken chromosome sequences, and the boundaries
of isochores obtained for each chromosome have been determined precisely
In summary, the cumulative GC profile marked with the coordinates of resulting segmentation points is a useful tool for genome analysis This leads to a neat graphical representation of G + C content variations along a genome or chromosome, and a clear-cut defini-tion of isochores This technique allowed us to show⁄ confirm that GC-rich isochores in a chicken chromosome have higher gene and CpG-islands densi-ties than AT-rich isochores Although these are well-known characteristics of isochores of the vertebrate organisms, the advantage of the technique is that an investigator is able to study all of these in a perceiv-able and precise manner We believe that a plot similar
to Fig 4 could become a common tool for analyzing
Fig 6 Correlation between the G + C content of isochore and the
density distribution of CpG islands With t 0 ¼ 100, only a total of
811 segments longer than 300 kb were considered as isochores
according to the definition of isochore Consequently, the
correl-ation coefficient and equcorrel-ation of the linear regression line were
given in the plot It shows there are positive and highly significant
correlations between the G + C content of these isochores and the
corresponding density distribution of CpG islands (R ¼ 0.82,
P < 0.001).
Trang 10the G + C content variations for any genome or
chro-mosome For higher eukaryotic genomes, the
cumula-tive GC profile equipped with the new segmentation
algorithm would be an appropriate starting point for
analyzing their isochore structures
Experimental procedures
The draft chicken genome sequence, release galGal2, and
its associated annotation files, such as the data of gene,
CpG island, SNPs, isochores from PSU, best alignments
with the human genome and so on, were downloaded from
http://genome.ucsc.edu/ In the present study, we follow the
convention of the International Chicken Genome
Sequen-cing Consortium (ICGSC 2004) by classifying chicken
(GGA1-5), five intermediate chromosomes (GGA6-10) and
28 microchromosomes (GGA11-38) Here, sex chromosome
W and microchromosomes smaller than GGA28 were
excluded from the study Our analysis of the distributions
of G + C content, CpG islands, and genes was only
restricted to the remaining 24 chromosomes The densities
of CpG islands and genes were calculated in 100 kb long,
nonoverlapping windows
A new segmentation algorithm of DNA
sequences
The genome order index S is defined by
S¼ SðPÞ ¼ a2þ c2þ g2þ t2 ð1Þ
where a, c, g and t denote the occurrence frequencies of
A, C, G and T, respectively, in a genome or a DNA
sequence The genome order index S defined in Eqn 1 is
a useful statistical quantity to reflect the compositional
characteristics of a genome [29], which can serve as an
appropriate divergence measure to quantify the
composi-tional difference between two DNA sequences [15] The
new segmentation algorithm proposed here is based on
the quadratic divergence (see Eqn 2) Consider a genome
with N bases Let n be an integer, 2£ n £ N – 1 For a
given n, the genome sequence is partitioned into two
sub-sequences, one left and the other right Let w1¼ n ⁄ N
and w2¼ (N) n) ⁄ N Let Pl¼ (al,cl,gl,tl) and Pr¼
(ar,cr,gr,tr), where al,cl,gl,tl and ar,cr,gr,tr are the
occur-rence frequencies of bases A, C, G and T in the left and
right subsequences, respectively Thus,
DSðPl;PrÞ ¼ ðn=NÞSðPlÞ þ ½ðN nÞ=NSðPrÞ
Sfðn=NÞPlþ ½ðN nÞ=NPrg; ð2Þ where S(P) is defined by Eqn 1 If we suppose that n* is a
position, at which DS(Pl,Pr) reaches maximum, then n* is
a compositional segmentation point of the genome first
found The new algorithm is also recursive, as in [23] and
[24], i.e after n* is determined, the same procedure is applied to both the resulting left and right subsequences, respectively The procedure should be applied recursively until DS(Pl,Pr) is less than a given threshold
However, a question which needs to be answered is the halting condition of the segmentation algorithm This is done by defining a halting parameter, t
where N is the length of sequence or subsequence to be seg-mented If t < t0, the segmentation procedure halts, other-wise, the procedure continues until t < t0 As we are only interested in segmenting concrete genomes, the choice of t0
is based on a heuristic consideration A larger threshold t0
leads to less segmentation points and longer segmented sub-sequences, whereas a smaller threshold t0leads to more seg-mentation points and shorter segmented subsequences For
an obtained segmentation point, it is important to know whether the halting parameter value is significantly different from that of a random sequence In order to halt the seg-mentation at different significance levels, we estimated the distribution of the halting parameter based on 100 000 ran-dom sequences with length of 1 Mb For each of these sequences, we calculated a halting parameter for the first point occurring during the course of segmentation and obtained thus 100 000 numbers Consequently, cumulative frequency and counts were plotted against the halting parameter, respectively (Fig 7) For example, if the signifi-cance level is 5% then t0corresponds to 6.194 However, a much more stringent stopping criterion is actually required
in most cases It should be noted that in some cases the segmentation procedure also halts when the resulting subse-quence is shorter than a given minimum length Here, we choose 3000 nucleotide as the minimum length according to
a requirement imposed by the experimental characterization
of isochores through DNA centrifugation [3] In general, the choice of t0 and the minimum length is heuristic and must be determined on a case by case basis [15]
Cumulative GC profile
znis defined as
zn¼ ðAnþ TnÞ ðCnþ GnÞ; n ¼ 0; 1; 2; :::; N; zn2 ½N; N; ð4Þ where An, Cn, Gn, and Tn are the cumulative numbers of the bases A, C, G and T, respectively, occurring in the subsequence from the first base to the n-th base in the DNA sequence inspected Here, zn is one of the compo-nents of the Z-curve, which is a three dimensional curve that uniquely represents a DNA sequence [34,35] Usu-ally, for an AT-rich (GC-rich) genome, zn is approxi-mately a monotonously increasing (decreasing) linear function of n To amplify the deviations of zn, the curve
of zn n is fitted by a straight line using the least squares technique,