Comparative Genome Analysis Focused on Periodicity from Prokaryote to Higher Eukaryote Genomes Based on Power Spectrum

Atsushi FUKUSHIMA, Toshimichi IKEMURA and Shigehiko KANAYA


Return

1 Introduction

When we try to understand natural architectures, it is important to investigate latent periodicity in them. The analysis of periodicities in genomic DNA is also important for clarifying the basic genomic architecture and is a complement to experimental works. Particularly, large portions of eukaryotic genomes are composed of repetitive DNA sequences such as satellite DNAs, minisatellites, microsatellites, and transposable elements. Survey of the various periodicities in these sequences may clarify the structure and function of genomic DNAs from a unique perspective. In recent years, statistical properties of DNA sequence (e.g. periodicity) have been examined by many methods including autocorrelation function analysis, Fourier spectrum analysis, DNA walking, entropy, Hurst index estimation, detrended fluctuation analysis, wavelet translation, mutual information function, and computational linguistics (reviewed in [1] and [2]).
In the present paper, we describe periodic structures of prokaryotes and eukaryotic genomes such as Caenorhabditis elegans, Arabidopsis thaliana, Drosophila melanogaster, Anopheles gambiae, and Homo sapiens using a power spectrum method and characterize their periodicities in nucleotide sequence level based on a parameter (Fk(N1,N2); described in Section 2.2). The other unique repetitive sequences identified by these methods, tandem repeats, are also present in a wide range of eukaryotic genomes. We tried to distinguish tandem repeats from intersperse repeats. This method discloses characteristic elements that were not detected by homology searches because of their relatively low levels of sequence homology.
This paper discusses in detail the periodicities of the analyzed genomes. Section 2 introduces the power spectrum method and the periodic nucleotide distribution parameter Fk(N1,N2). Section 3.1 shows the short periodicities in prokaryotes and eukaryotes genomes that have been sequenced completely. Power spectrum analysis reveals novel periodicities in C. elegans genome, A. thaliana, D. melanogaster, and H. sapiens respectively.@We discuss the long-range periodicity in eukaryotic genomes, including A. gambiae, that is associated with a relation between fractal property and gene organization of genomes.

2 Methods

2. 1 Power spectrum analysis

The power spectrum is a transformation of variables in the "frequency space". It has an advantage in that any periodic patterns in the original data - "hidden" or "latent" - become clear after transformation; hidden periodic signals are represented as peaks in the spectrum. In order to apply to a genome, we transformed each DNA sequence into a binary sequence xj. When base 'b' is present at position j, xj = 1; otherwise, xj = 0 (j = 0, 1, 2,..., N-1) [3 - 5]. Here, 'b' is one of the four nucleotides (b = A, T, G, or C). Consequently, we can obtain four binary sequence sets from a genomic DNA sequence. The power spectrum of the binary sequence xu of length N is defined as Eq. (1).

Where i2 = -1 and fj = j/N (j = 0, ..., N-1). The averaged power spectrum was plotted as Eq. (2).

We used a fast Fourier transform algorithm, which accelerates calculation of the power spectrum but requires the length of DNA sequence analyzed be a power of 2, that is, 2n nucleotides.

2. 2 Periodic nucleotide distribution

To assign regions that contribute to each periodicity we calculated periodic nucleotide distributions [6]. This parameter can measure the relative frequency of a nucleotide pair at k-bp distance. For example, fk(N1,N2) denotes the frequency of the nucleotide pair N1 and N2 separated by k-bp in window L, and f(Ns) denotes the frequency of the single nucleotide Ns (s = 1, 2) in window L. Therefore periodic nucleotide distribution parameter Fk(N1,N2) is calculated by Eq. (3).

When occurrences of the nucleotides N1 and N2 at distance k are statistically independent, fk(N1,N2) = f(N1) . f(N2). If Fk(N1,N2) is significantly higher than 1, the nucleotide pair is highly abundant in some regions of the genome. Bias in the nucleotide composition of the genome is canceled out in Fk(N1,N2). In the present analysis, N1 and N2 were set as identical nucleotides (A, T, G, and C); that is, we examined Fk(N1,N1) for individual DNA sequences.
The parameter Fk(N1,N2) denoted by Eq. (3) can be extended from nucleotide to nucleotide sequence occurrences. In the uth nucleotide sequence, the numbers of four nucleotides (A, T, G, and C) in nucleotide sequence NSu are denoted by bA(NSu), bT(NSu), bG(NSu), and bC(NSu); fk(NS1,NS2) represents the frequency of the pair of nucleotide sequences NS1 and NS2 separated by k-bp in window L; f(NSu) represents the frequency of the nucleotide sequence NSu estimated statistically by single nucleotide frequencies (u = 1, 2) in window L. Periodic nucleotide distribution Fk(NS1,NS2) is therefore calculated by Eq. (4).

Where, f(NSu) = f(A)bA(NSu) . f(T)bT(NSu) . f(G)bG(NSu) . f(C)bC(NSu).

3 Results and discussion

3. 1 Overview of short periodicities in prokaryotic and eukaryotic genomes

For the short periodicity region, power spectra for genomes that have been sequenced completely are shown in Figure 1. All prokaryote genomes have a 3-bp periodicity (corresponding to frequency f = 1/3), and most also have a 10-11 bp periodicity (f 1/10-1/11). Additionally, 3-bp period has also observed in S. cerevisiae, C. elegans, A. gambiae, D. melanogaster, and H. sapiens. The former corresponds to the periodicity associated with codon usage [7 - 9], and the latter is associated with the DNA helical repeat structure (10.55 ± 0.01 bp) [6, 10, 11]. In the present paper, 10-bp periodicity was prevalent in hyperthermophilic bacteria and archaebacteria (Figure 1a, b), and 11-bp periodicity was prevalent in eubacteria. These results are consistent with those of other reported spectral analyses [12, 13]. If sequence periodicities reflect the characteristic superhelical densities of genomic DNAs, the differences in periodicities between hyperthermophilic bacteria, archaebacteria, and eubacteria may be explained as follows: archaeal histones are structurally similar to eukaryotic core histones, that is, eukaryotic and archaeal DNAs are packed as nucleosomes in negatively-constrained supercoils [14]. It is known that periodicities in the genomes such as the 10-bp periodic occurrence of AA, cause curvature of the DNA. It is reasonable to think that genome sequences in eukarya and archaea are organized and stabilized by interactions between histones and nucleotides, and the 10-bp periodicity contributes to this nucleosome organization. The observed 11-bp periodicity is consistent with the occurrence of negative supercoiling reported in bacterial DNAs [10, 15]. The 10-bp periodicity was observed in all eukaryotes examined; and the periodicity observed for C. elegans was more prevalent than those of S. cerevisiae and A. thaliana (Figure 1c).


Figure 1. Power spectra of genomes for eubacteria (a), archaea (b), and eukarya (c) at short frequencies. All bacterial genomes have 3-bp periodicity (corresponding to frequency f = 1/3), and most bacterial genomes have 10-11 bp periodicity (f 1/10-1/11).

3. 2 Periodicities in C. elegans genome

We tried to distinguish tandem repeats from intersperse repeats in genomic sequence. Our method discloses characteristic elements that were not detected by homology searches because of relatively low levels of sequence homology. The power spectra for all C. elegans chromosomes [16] are shown in Figure 2. Each chromosome is divided into 222 bp subsequences (approximately 4.2 Mb) along the DNA strand registered in GenBank with a moving step-size of 2.1 Mb. Although there are many peaks in regions with frequencies higher than 2 × 102 (i.e., periodicity smaller than 50-bp), we focused in the present study on several distinct periodicities found in regions with frequencies smaller than 2 × 102 (i.e., periodicity larger than 50-bp) that were not characterized previously. Short size repeats, such as tandem repeats, in eukaryotic genomes have been identified by previous study [17]. These included a 68-bp periodicity in chromosome I, a 59-bp periodicity in chromosome II, and a 94-bp periodicity in chromosome III (shown in Figure 2).
To relate these periodicities to nucleotide sequences, we examined the genomic distribution of nucleotide pair N1 and N2 separated by k-bp with parameter Fk(N1,N1). The distributions of the 68-, 59-, and 94-bp periodicities in a 10-kb window along each chromosome are shown in Figure 3. We found nine regions which are designated by ID numbers CE1 to CE11 with Fk(N1,N1) higher than 1.5 (Figure 3). Table 1 lists the consensus sequences comprising the individual periodicities. The 68-bp periodicities were found for the four regions of chromosome I (CE1, 1.34-1.35 Mb; CE2, 8.53-8.55 Mb; CE3, 12.32-12.33 Mb; CE4, 14.86-14.87 Mb). Interestingly, the consensus sequences were not necessarily similar between individual regions even if the pitches of the periodicities were identical. For example, the consensus sequence of CE2, which is a cluster composed of 219 copies of the 68-bp periodicity, is similar to that of CE3 but very different from those of CE1 and CE4 (Table 1). It should be noted that the CE1 contained as a core element a 12-bp sequence, CeRep45 (TTGGTTGAGGCT), that was characterized previously[18]. We found also a 59-bp periodicity in three regions (CE5 to CE7) of chromosome II, the 94-bp periodicity in three regions (CE8 to CE10) of chromosome III, and the 94-bp periodicity in one region (CE11) of chromosome IV.
Chromosome-distinct periodic segments of 11-16 bp in length have been reported[18]. These sequences were found primarily near telomeres and were predicted to be responsible for meiotic pairings. The three periodicities (CE2, CE7, and CE9) found in this study are far larger than those reported previously [17], and they distribute along the chromosomes. Several are located near centers of chromosomes. The C. elegans genome lacks monocentric chromosomes, and instead has holocentric chromosomes. Diffuse kinetochores are formed along the entire length of each chromosome [19], and clear centromeric sequences are lacking in C. elegans. Though it is unclear if the periodic sequences observed in this study are related to centromeric function, the strategy proposed here may have the power to uncover hidden periodic sequences, if present, that are related to centromeric function.


Figure 2. Power spectra (in log-log scale) of C. elegans chromosomes. Each genomic DNA sequence is divided into a subsequence of length 222 bp (approximately 4.2 Mb) with a step-size of 2.1 Mb. Circles indicate locations of evident periodicity length in the genomic sequences.

Table 1. Periodic elements in C. elegans.
IDChrRegion (Mb)Period (bp)The number of consensus core sequencesConsensus core sequence
CE1I1.34- 1.35686TTGCTGATCTCGGTAAATATGCCAAATTTC
CCGTTTGCCGACATCGGCAAATTTGCGGAA
TTCGCCGT
CE2I8.53- 8.5468219TTTGTGTTTTCTTTCTGAAATTCTAAGAAT
TTTGGTAAAAGAAAACCATTGTCAACTGAA
TAGGTTGA
CE3I12.32-12.336829TTTGTGTTTTCTTTCTGAAATTCTAAGAAT
TTTGTTAAAAGAAAACCATTGTCAACTGAA
TAGGTTGA
CE4I14.86-14.876817TTAATTTTGGTTGAGGCTAACACACTACAA
ACTACAACATTTTCTAGCCTCAACCAATTA
AAAAAAAA
CE5II0.68- 0.695976GGTGAGACCCATCGCGGTGAGACCCATCGT
GACGAGACCTTTCGTGGTGAGACCCATCGT
CE6II1.02- 1.0359194TTCGTGGTGAGACCC
CE7II10.19-10.2159297TTTGAAAACCAGTGCACAATTGAAACTCCA
TATTCTCAATAATTCTCAGTTTAAAAAAA
CE8III5.38- 5.3994(none)
CE9III6.22- 6.2794329TTTTCCCATTGATTTGTCTACAAAGGGCAT
CGAAAAGCACCCAATATTTAGAGAACAGAA
GATTTTGAGAATTACTGCCTCCAGAAATTG
ATGA
CE10III10.54-10.5594151TTTGCGGTTTGC
CE11IV3.16- 3.1794121TTCATCTAATGGTCTAACTTTGGAAA


Figure 3. Periodic nucleotide distributions based on Fk(N1,N1) values with 10-kb window in C. elegans chromosomes. Nine regions with Fk(N1,N1) values higher than 1.5 are designated by ID numbers CE1 to CE11 (corresponding to Table 1). The total length of each chromosome is normalized to 1.0.

3. 3 Periodicities in A. thaliana genome

In order to compare, we analyzed periodicities of A. thaliana genome [20 - 24]. Figure 4 shows power spectra for five chromosomes. These results are consistent with those short periodicities in C. elegans genome and in other reported spectral analyses [12, 13]. Thus, using the power spectrum method is useful for detecting periodic structure along genomes. We also found that chromosome 3 contained three apparent peaks at the center of A. thaliana genome and observed that chromosomes 4 and 5 have many sharp peaks. Here, we focused on several distinct periodicities found in regions with frequencies smaller than 2 × 102 (i.e., periodicity larger than 50-bp) that were not characterized. Remarkable periodicities are as follows: three periodicities (248 bp-, 167 bp-, and 126 bp) in chromosome 3, three periodicities (174 bp-, 88 bp-, and 59 bp-period) in chromosome 4, and four periodicities (356 bp, 174 bp, 88 bp, and 59 bp) in chromosome 5 (see Figure 4).
To relate these periodicities to nucleotide sequences, we examined the genomic distribution of periodic nucleotide sequences using parameters Fk(N1,N1) and Fk(NS1,NS1). Figure 5 shows the distributions of each periodicity in a 10-kb window along each chromosome. In this figure, remarkable peaks were obtained for k = 126 and 174 which are designated by AT1 to AT9. Table 2 lists core sequences comprising periodic structure with k-bp distance with Fk(N1,N1) or Fk(NS1,NS1) higher than 2, that is, the frequencies of these sequences are two times higher than those estimated statistically. In the region AT1, GGN-type sequences are obtained periodically, and an ORF included in this region has Gly in high frequency. Codons of Gly correspond to GGN. Thus, core sequences comprising periodic structure reflect amino acid composition in ORF. The periodic sequences that reflect amino acid composition obtained from the present analysis are listed in Table 3. The common sequence, SPPPPYVYSSPPPPYYS, is also obtained in six regions AT2, AT3, AT5, AT6, AT8, and AT9.


Figure 4. Power spectra of A. thaliana complete genome for short-range periodicities in log-log scale: (a) chromosome 1 (ACCESSION #: NC_003070), (b) chromosome 2 (NC_003071), (c) chromosome 3 (NC_003072), (d) chromosome 4 (NC_003073), and (e) chromosome 5 (NC_003074). The spectra of A resemble those of T, while spectra of G curves are similar to those of C.

Table 2. Core sequences consisting of 2 or 3 nucleotides with Fk(NS1,NS1) higher than 2.
AT1F126AT2F126AT3F126AT4F126AT5F126
GG4.7GG19.2AA3.7AAG78.9TA7.4
TGG726.6GTA174.3AT11.3AAA9.4CC3.0
TTT29.4GAA74.5AGA57.0TAC318.7
GGC3893.9TGG455.6TAA21.6CCA533.2
GGT363.3TTT29.4TTT58.4
GTA223.9TGG1527.3
GTA239.9
AT6F174AT7F174AT8F174AT9F174
TGG672.2TT16.8GG2.8AA2.8
TTT53.3TA2.7AT8.7AT13.9
AAA54.8CCA290.4TGG736.7TGG909.5
AAA32.5AAA47.9TTT32.3
CACGTA109.9GTA124.2
AAA56.2
ATT31.6


Figure 5. Periodic nucleotide distributions based on Fk(N1,N1) values with 10-kb window in A. thaliana: (a) chromosomes 3, (b) chromosome 4, (c) chromosome 5. Nine regions with Fk(N1,N1) values higher than 1.4 are designated by ID numbers AT1 to AT9.

Table 3. Relation between periodicities and amino acid sequences in ORFs.
IDchr.aa-sequence characteristicsproduct protein [gene name]
AT13Gly-richhypothetical [At3g23450]
AT23SPPPPYVYSSPPPPYYS-repetitiveunknown [At3g28550]
AT33SPPPPYVYSSPPPPYYS-repetitiveextension precursor-like [At3g54580]
AT43Gly, Ser, Ala-richhistone-H4-like [At3g28780]
AT53SPPPPYVYSSPPPPYYS-repetitiveextension precursor -like [At3g54590]
AT64SPPPPYVYSSPPPPYYS-repetitiveextension-like [At4g08410]
AT74Not foundhypothetical protein [At4g01980]
AT85SPPPPYVYSSPPPPYYS-repetitiveputative [At5g06640]
AT95SPPPPYVYSSPPPPYYS-repetitiveputative [At5g49080]

3. 4 Periodicities in D. melanogaster genome

Since the D. melanogaster is one of the widely investigated organisms in biology, there is interest in the periodic structures of the genome. Figure 6 shows the power spectra for all D. melanogaster genomic DNA sequences [27] at short-range periodicity. All chromosomes have 3-bp periodicity (frequency f = 1/3) and all chromosomes have 10 bp periodicity (f 1/10). We also found that all chromosomes have 5 bp period, f 1/5, as pointing arrows. Interestingly, the broad weak peaks exist over the entire genomes and seem to be remarkable, especially in A and T.
Figure 7 shows periodic nucleotide distributions based on Fk(N1,N1) values with 10-kb window in D. melanogaster X chromosome. Period sizes are k = 3 bp, 4 bp, and 5 bp. Here, if occurrences of the nucleotides N1 and N1 at distance k are statistically independent, fk(N1,N1) = f(N1) . f(N1) and the Fk(N1,N1) value is 1.0. In G or C curves, the highest average is obtained for 3-bp period and the others also have Fk(N1,N1) higher than 1.0 (corresponding to random sequence). We have already pointed out this result (3 bp) as the periodicity associated with codon structure. Trinucleotide repeats in Drosophila stretches tended to be longer than the other repeats [17]. In contrast, A or T curves have the lowest values for 3 bp-period, that is, 4 bp- and 5 bp-periodicities for A and T are widely distributed in fly genome. This suggests that 4- and/or 5-mer have specific structures in D. melanogaster genome.


Figure 6. Power spectra of D. melanogaster genomic DNA at short-range periodicity in log-log plot: (a) chromosome X (ACCESSION #: AE002566), (b) chromosome X (AE002593), (c) chromosome 2L (AE002690), (d) chromosome 2R (AE002787), (e) chromosome 3L (AE002602), and (f) chromosome 3R (AE002708).


Figure 7. Periodic nucleotide distributions based on Fk(N1,N1) values with 10-kb window in D. melanogaster X chromosome. Period sizes are 3 bp , 4 bp, and 5 bp.

3. 5 Periodicity in human chromosomes 21 and 22

One interesting question that may be asked by biologists would be how does the human genome differ from that in other species ? From a periodic point of view, it is important to examine H. sapiens genome. Power spectra of human chromosomes 21 and 22 are shown in Figure 8. We found two broad peaks being centered at the 167- or 84-bp periodicity across entire lengths of both chromosomes. Interestingly, the 167-bp periodicity is identical to the length of DNA that forms two complete helical turns in one nucleosome with H1 histone [26]. It is possible that the respective sequences form contiguous arrays of a specific compact form of nucleosome. The distributions of the 84- and 167-bp periodicities visualized as the periodic nucleotide density for a 10 kb-window on human chromosomes 21 and 22 are shown in Figure 9. These periodicities are present across entire chromosomes because the baselines of these distributions along the chromosomes are shifted to a level clearly higher than 1. The core elements corresponded to evident peaks (shown in Figure 9) contain a high frequency of TGG (Table 4). For example, in the case of 42 copies of a 167-bp periodic element clustered in the 3.49-3.50 Mb region of chromosome 22, each element was composed of TGG-containing sequences such as GGCTGG, CTGGCT, and GCTGGC when represented by hexanucleotides (Table 4). In chromosome 22, the regions HS2 with high frequencies of TGGs were also observed near the centromere; e.g., 0.39 to 0.40 Mb (a cluster 317 of copies of 84-bp elements, HS3), 3.49 to 3.50 Mb (see Table 4 for the periodic elements). On chromosome 21, a cluster of the same core elements occurred in the region near the telomere (HS1). TGG-rich sequences are known to form a specific subset of folded DNA structures and to be associated with the self-assembly phenomenon [27, 28]. DNA sequences with TGG-core elements may form specific higher-order structure related to the clustered occurrence of a specific form of the nucleosomes hypothesized above.


Figure 8. Power spectra of human chromosomes 21 and 22. For the high-frequency region, the spectra behaviors differ from those of C. elegans. In the middle-frequency region, broad peaks centered at the 84- or 167-bp periodicity are observed for all subsequences of both chromosomes.


Figure 9. Periodic nucleotide distributions based on Fk(N1,N1) values with 10-kb window in human chromosomes 21 and 22. Eleven regions with Fk(N1,N1) values higher than 1.5 are designated by ID numbers HS1 to HS9 (corresponding to Table 4). The total length of each chromosome is normalized to 1.0.

Table 4. Periodic elements in H. sapiens.
IDChrRegion (Mb)Period (bp)Consensus core represented with hexanucleotide compositionThe number of consensus core(> 20 pairs)
HS12141.54-41.5584GTGGTG167
TGGTGG167
GGTGGT166
TAGTGG97
TGGTGA96
GTGATG94
ATGGTG92
TGATGG92
GATGGT91
HS2220.39-0.4084TGGTGG317
GGTGGT309
GTGGTG281
TGATGG88
GATGGT84
ATGGTG70
GTGATG74
HS3223.49-3.5084TGGCTG42
GGCTGG38
CTGGCT27
GCTGGC20
HS4225.63-5.6484ATTTCA34
TTTCAT31
TCATTT30
CATTTC27
HS52233.82-33.8384AATGTG27
HS62234.29-34.3084AATGTG22
HS7223.49-3.50167TGGCTG42
GGCTGG38
CTGGCT27
GCTGGC20

3. 6 Spectrum landscape for long-range periodicity

It is reported that the slope of the logarithm of power (log S(f)) and the logarithm of frequency (log(f)) have been associated with fractal properties [3, 4, 29]. Flat power spectra can be associated with random sequences; when the slope of the logarithm of power (log S(f)) to the logarithm of frequency (log(f)) is 0, the nucleotide sequences can be produced by random processes corresponding to random mutations. In the case of a slope close to -1, the nucleotide sequence has the signature of fractal correlation, and a slope between 0 and -1 indicates long-range correlation [29, 30].
Power spectra at low frequencies are examined for five chromosomes of A. thaliana (Figure 10). The slopes of the spectra for four kinds of nucleotides have almost the same behavior in this range, so the relation between gene number and a slope (exponent) for adenine is depicted in Figure 11. The exponent is correlated with the gene number of the respective chromosomes. A chromosome with a high gene number tends to have a slope closer to -1. These indicate that there is a relation between fractal property and gene organization of genomes. These results are consistent with those for H. sapiens reported in the following section.


Figure 10. Power spectra of A. thaliana genome for long-range periodicities in log-log scale: (a) chromosome 1, (b) chromosome 2, (c) chromosome 3, (d) chromosome 4, and (e) chromosome 5. For all chromosomes the slopes of the spectra have very similar behavior in this range.


Figure 11. The relation between gene number and exponent (the slope of the logarithm of power (log S(f)) to the logarithm of frequency (log(f))) for adenine in A. thaliana chromosomes. The exponent is correlated with the gene number of the respective chromosome (correlation coefficient -0.73). The slopes of other nucleotide spectra resemble each other.

For long periodicities the power spectra of D. melanogaster and A. gambiae [45] genomic DNA sequences in log-log plot are shown in Figures 12, 13, respectively. The spectra of A resemble those of T, while spectra of G are similar to those of C. Interestingly, G or C spectral curves have a flat region at middle frequency range from f = 10-4 to 10-5 (corresponding to period size 5 kb-1 kb) in fly, while mosquito genomes have two flat regions. The properties of DNA sequence correlation are called "partial power-law", that is, for high frequencies the power spectrum is roughly flat, while the spectrum at low frequencies presents power-law decay (f –β) with exponent approximately equal to -b. The "1/f" noise (b = 1) of the given frequency range shows the existence of the fractal structure corresponding to range of wavelength. A recent research concerning DNA sequences revealed that the behavior of the power spectrum as a function of the frequency represents three different regions in the logarithm scale, that is, the spectrum changes from a flat region, to a power-law region, and then becomes flat again [42]. A flat power spectrum means lack of correlation, such as random sequences. These flat power spectra for middle frequency have not been observed in the other eukaryotes such as S. cerevisiae, C. elegans, A. thaliana, and Homo sapiens as well as prokaryotes analyzed [6]. Taking these into consideration, our results in D. melanogaster genomes are very important findings for genome architecture [44]. We emphasize that this property must help us to understand the origin and the evolution of the genome. One of the possible origins of this middle flattened region may be related with puffs for D. melanogaster chromosomes [43].


Figure 12. Power spectra of D. melanogaster genomic DNA sequences at low frequencies in log-log plot: (a) chromosome 2R (ACCESSION #: AE002787), (b) chromosome 2L (AE002690), (c) chromosome 3R (AE002708), and (d) chromosome 3L (AE002602). The spectra of A resemble those of T, while G curves are similar to C. Interestingly, G or C spectral curves have flat regions at range from f = 10-4 to 10-5 (corresponding to cyclic size 1 kb(5 kb).


Figure 13. Power spectra of A. gambiae genomic DNA sequences at low frequencies in log-log scale: (a) chromosome X (ACCESSION #: AAAB01008807), (b) chromosome 2 (AAAB01008987), (c) chromosome 2 (AAAB01008960), and (d) chromosome 3 (AAAB01008984). The spectra of A resemble those of T, while G curves are similar to C. Interestingly, G or C spectral curves have two flat regions at range corresponding to cyclic size from 0.1 kb to 0.5 kb and from 1.6 kb to 10.0 kb.


Figure 14. (a) Representative power spectra of human chromosomes and description of slopes a and b. Slope a is defined for the region with larger than 105 bp periodicity (frequency < 10-5) and b for the region of 104 to 105 bp periodicity. Power spectra are drawn for human chromosomes 4 and 22. (b) The relations between genome GC% and slopes (a and b) for all human chromosomes; human draft genomic sequences compiled by GenBank for individual chromosomes were analyzed. All chromosomes have rather similar a slopes close to -1 that are independent of GC%. In contrast, the b slope is highly correlated with the GC% of the respective chromosome.

Human genomes have heterogeneous properties that appear to be characterized by two distinct slopes designated a for the region with larger than 105 bp periodicity (frequency < 10-5) and b for the region of 104 to 105 bp periodicity (Figure 14a). While GC composition is known to be homogeneous within genomes of most prokaryotes and unicellular eukaryotes, the genomes of higher vertebrates have mosaic GC% structure "isochore" [31 - 34] that appears to be related to replication timing [35 - 37]. This complexity may be reflected in the heterogeneous nature of the slopes observed for individual human chromosomes. The relations between GC% and slopes (a and b of each human chromosome are shown in Figure 14b. Although all chromosomes had similar a slopes close to -1 regardless of GC%, the b slope observed in the range from 10 to 100 kb was clearly correlated with the GC% of each chromosome. The range from 10 to 100 kb is roughly the size of many genes. In the human genome, GC% is known to be related to gene density. For example, human chromosomes 19 and 22 have high GC% (49% and 48%, respectively) and high gene density (23 and 17 genes/Mb, respectively); conversely, chromosomes 4 and 13 have low GC% (both 38%) and low gene density (6 and 5 genes/Mb, respectively) [38, 39]. Chromosomes with a high GC% and high gene density tend to have a b slope closer to -1 and these may indicate a relationship with the fractal structure of the chromosomes. Such fractal structures may reflect the highly variegated landscape of GC-poor and GC-rich isochors typical of these chromosomes [40, 41] and the gene organization (and presumably exon and intron organization) in these chromosomes.

This work was supported by a Grant-in-Aid for scientific research on priority areas from Mombukagakusho (Ministry of Education, Science, Sports and Culture of Japan) and JST (Japan Science and Technology).

References

[ 1] W. Li, Computers Chem., 21, 257-271 (1997).
[ 2] S. V. Buldyrev, N. V. Dokholyan, A. L. Goldberger, S. Havlin, C. -K. Peng, H. E. Stanley, G. M. Viswanathan, Physica, A, 249, 430-438 (1998).
[ 3] R. F. Voss, Phys. Rev. Lett., 68, 3805-3808 (1992).
[ 4] W. Li, K. Kaneko, Europhys. Lett., 17, 655-660 (1992).
[ 5] W. Li, G. Stolovitzky, P. Bernaola-Galvan, J. L. Oliver, Gen. Res., 8, 916-918 (1998).
[ 6] A. Fukushima, T. Ikemura, M. Kinouchi, T. Oshima, Y. Kudo, H. Mori, S. Kanaya, Gene, 300, 203-211 (2002).
[ 7] J. C. Shepherd, J. Mol. Evol., 17, 94-102 (1981).
[ 8] J. W. Fickett, Nucleic Acids Res., 10, 5303-5318 (1982).
[ 9] R. Staden, Methods Enzymol., 183, 163-180 (1990).
[10] E. N. Trifonov, J. L. Sussman, Proc. Natl. Acad. Sci. USA, 77, 3816-3820 (1980).
[11] M. Tomita, M. Wada, Y. Kawashima, J. Mol. Evol., 49, 182-192 (1999).
[12] E. N. Trifonov, Physica, A, 249, 511-516 (1998).
[13] H. Herzel, O. Weiss, E. N. Trifonov, Bioinformatics (CABIOS), 15, 187-193 (1999).
[14] K. Sandman, J. N. Reeve, Arch. Microbiol., 173, 165-169 (2000).
[15] A. Vologodsky, Topology and Physics of Circular DNA, CRC Press, Boca Raton (1992).
[16] C. elegans Sequencing Consortium, Genome sequence of the nematode C. elegans: A platform for investigating biology, Science, 282, 2012-2018 (1998).
[17] M. V. Katti, P. K. Ranjeker, V. S. Gupta, Mol. Biol. Evol., 18, 1161-1167 (2001).
[18] C. Sanford, M. D. Perry, Nucleic Acids Res., 29, 2920-2926 (2001).
[19] D. E. Comings, T. A. Okada, Chromosoma, 37(2), 177-192 (1972).
[20] A. Theologis, J. R. Ecker, C. J. Palm et al., Nature, 408, 816-819 (2000).
[21] X. Lin, S. Kaul, S. Rounsley et al., Nature, 402, 761-768 (1999).
[22] European Union Chromosome 3 Arabidopsis Sequencing Consortium, The Institute for Genomic Research & Kazusa DNA Research Institute, Nature, 408, 820-822 (2000).
[23] The European Union Arabidopsis Genome Sequencing Consortium & The Cold Spring Harbor, Washington University in St Louis and PE Biosystems Arabidopsis Sequencing Consortium, Nature, 402, 769-777 (1999).
[24] The Kazusa DNA Research Institute, The Cold Spring Harbor and Washington University in St Louis Sequencing Consortium & The European Union Arabidopsis Genome Sequencing Consortium, Nature, 408, 823-826 (2000).
[25] M. D. Adams, S. E. Celniker, R. A. Holt et al., Science, 287, 2185-2195 (2000).
[26] R. R. Sinden, DNA structure and function, Academic Press, Inc., San Diego (1994).
[27] F. M. Chen, Biophys. J., 73, 348-356 (1997).
[28] K. Usdin, Nucleic Acids Res., 26, 4078-4085 (1998).
[29] W. Li, Phys. Rev., A, 43, 5240-5260 (1991).
[30] W. Li, Int. J. Bifurcation and Chaos, 2, 137-154 (1992).
[31] G. Bernardi, B. Olofsson, J. Filipski, M. Zerial, J. Salinas, G. Cuny, M. Meunier-Rotival, F. Rodier, Science, 228, 953-958 (1985).
[32] G. Bernardi, Annu. Rev. Genet., 23, 637-661.
[33] T. Ikemura, Mol. Biol. Evol., 2, 13-34 (1985).
[34] T. Ikemura, S. Aota, J. Mol. Biol., 203, 1-13 (1988).
[35] G. P. Holmquist, J. Mol. Evol., 28, 469-486 (1989).
[36] G. Bernardi, Gene, 241, 3-17 (2000).
[37] Y. Watanabe, A. Fujiyama, Y. Ichiba, M. Hattori, T. Yada, Y. Sakaki, T. Ikemura, Hum. Mol. Genet., 11, 3-21 (2002).
[38] E. S. Lander, L. M. Linton, B. Birren et al., Nature, 409, 860-921 (2001).
[39] J. C. Venter, M. D. Adams, E. W. Myers, Science, 291, 1304-1351 (2001).
[40] A. Pavlicek, K. Jabbari, J. Paces, V. Paces, J. Hejnar, G. Bernardi, Gene, 276, 39-45 (2001).
[41] J. L. Oliver, P. Bernaola-Galvan, P. Carpena, R. Roman-Roldan, Gene, 276, 47-56 (2001).
[42] M. S. Vieira, Phys. Rev, E, 60, 5932-5937 (1999).
[43] D. M. Gilbert, Science, 294, 96-100 (2001).
[44] A. Fukushima, T. Ikemura, T. Oshima, H. Mori, S. Kanaya, Genome Informatics, No.13, 21-29 (2002).
[45] R. A. Holt, G. M. Subramanian, A. Halpern et al., Science, 298, 129-149 (2002).


Return