Return

Figure 1. Parameters and variables used for the Monte-Carlo simulation. A genome sequence of *N* nucleotides in length was allotted to an array, *A(i)*, of *N* elements on computer. A piece of snow (*i.e.*, a sequence of *n* nucleotides) was subjected to fall on a site which spanned from *i* to *i*+*n*-1 and, then, the value of the corresponding array elements were added by 1. The site i was determined by a random number generated on computer. A contig is defined to be a stretch of array elements which hold a non-zero value. On the other hand, a gap is a continuum of array elements that hold zero value. Site *redundancy* is defined as the value of an array element at a particular site while *redundancy*, *r*; is defined to be S_{i} *A(i)/N*. *Coverage* k is defined as (the total number of array elements which hold non-zero value)/ *N*.

The key parameters employed for composing the program are *genome size(N)*, *fragment size(n)*, *redundancy(r)*, *coverage(k)*, *the number of contigs (j)* , which are depicted in Figure 1. Genome DNA(s) which can be either single or plural (or fragmented) chromosomes were treated as a single linear molecule (called *genome sequence*) by imaginatively ligating those fragments into one (although this approximation introduces a minor problem caused by ligating chromosomes and the resulting end-vanishing, it can be negligible so far as *n* is substantially small (say, less than a hundredth of *N*). Each nucleotide of a genome sequence (*i*-th nucleotide) is represented by an array element (*A(i)*: *A* has *N*-dimension) in computation. In the program(*Yuki-com*^{2}), a random number, *x* (0<=*x*<=1) which has sufficiently large significant figures, is generated in order to determine the site for a sequence read (*i.e.*, the starting point of the sequence (*P _{s}*) is calculated to be INT (

where d(

Simulations were repeated, usually, 1000-7000 times until the value of standard deviation of

Therefore, this program has a nature of being usable for many phenomena which conform to Poisson distribution. This program(

Figure 2. A plot of *redundancy* versus *coverage*. The abscissa is a processed value of *coverage* as 1-k. The ordinate is for *redundancy*. The symbols express the combination of parameters (the genome size (*N*) and the size of sequence read (*n*)); ((*N*=) 5.2×10^{7} and (*n*=) 400), (5.2×10^{7}, 100), (6.5×10^{6}, 400), (6.5×10^{6}, 100),(6.5×10^{6}, 4), and (6.5×10^{5}, 400), respectively. The thick line is a regression line for the relevant points, from which Eq.3 was obtained while the other thin lines are polygon lines.

The upper limit dependence of *r* on 1-k shown in Figure 2 can be formulated as follows:

Equation 3 is useful for estimating the optimal value of k at which there must be a methodological shift from a random process of a shotgun to a directed process such as PCR relay [7], which can be called *switching point*. If we put the total cost for a whole genome sequencing as *C* and the costs of a shotgun process (which continues from the beginning to the switching point at the *coverage* of k) and a directed process for sequencing a unit length of fragment as *c*_{s} and *c*_{d}, respectively, *C* can be written as follows:

where *l* represents the average length of the sequences read (in this simulation, no size distribution assumed), and the subscripts, *s* and *d*, stand for 'shotgun' and 'directed', respectively. Using Eq.3 where *r = r*_{s}, and subjecting the result to Taylor expansion, Eq.4 can be arranged as follows:

Therefore,

Since 0<=k<1 and all factors appearing in Eq.6 are positive, *C'* is monotonous. As *C'(0)*<0, *C'*(k) is equal to 0 at one and only value of k(k_{min}), where *C* has the minimum (*C*_{min}) . k_{min} and *C*_{min} can be numerically obtained using Eqs. 5 and 6. If *c*_{d} / *c*_{s} =3, *l*_{s}*l*_{d}(=400), and *r*_{d}=2.2, k_{min}=0.84 ( where *r*_{s}=1.8 from Eq.3). If *c*_{d} / *c*_{s} = 10 and the rest are the same with the above, k_{min} = 0.95 (*r*_{s}=2.9). This means that depending on the cost of sequencing per nucleotide (*c/l*) for both the processes, shotgun and directed, we can optimize the total scheme (although we think the ongoing genome sequencing project must be more dynamically operated because of incessant technical improvements and thus the estimation performed here remains tentative). Eq.3 gives us an optimal (lowest at cost or shortest at term or so) solution for planning a shotgun-based/directed- process-complemented genome sequencing.

Figure 3. Dependence of *redundancy* on genome size (*N*) and *coverage* (k). The genome-size dependence of *redundancy* was examined by computer simulation at various *coverage* values: k=0.9, 0.99, 0.999, 0.9999, and 1 from bottom to top with either value of *n* =100 () or *n*=400 (), respectively. Some of the experimental values obtained from the actual genome sequencing projects (see Table 1) are plotted with a symbol(×).

Figure 2 also contains an important fact that the *redundancy (r)* of a shotgun process applied to human genome sequencing is, theoretically, at most 5 at the stage of 99% *coverage* (k=0.99) performed under possible conditions. In addition, there seems to be no substantial difference in efficiency between bacterial genome (Mb-sized) and human genome (Gb-sized) sequencing, which is more clearly shown in Figure 3. Rationally, due to its fractal nature, Gb-sized genome sequencing is approximately equivalent to 1000 times of Mb-sized one when a shotgun process is employed. Though computer works become complex and burdensome (*e.g.*, to find a partially overlapping sequence out of n species of fragments, the amount of calculation becomes in proportion to n^{2}, not n, therefore, at least in such an aspect, sequencing of a genome of *m*-fold length requires more than an *m*-fold amount of works), it is already well-known that the ratio of the computer work to the whole is rather small [6, 8].

Table 1 shows experimental scores of *redundancy* collected from the bacterial genome sequencing, mostly performed and published by Venter and his collaborators [8 - 14]. In average, genomes of Mb in size are sequenced at the *redundancy* of near 7. Some of those values are plotted in Figures 3, 4. If we assume that shotgun process was pursued up to 99 % *coverage* in those genome sequencing, those experimental values are larger by around 2 than those of simulation (Figure 3). The increment of the experimental values from those of simulation can be explained by the facts that a fraction of the length, *Dl *(>=log_{4}*N*; which is, theoretically, the smallest number of a stretch of nucleotides in order to be unique in a genome of the size, *N*) is required to declare overlap and that there was, in reality, a size distribution in sequences read which is known to contribute to slightly increase *redundancy* [5] while, throughout this simulation, such effects were neglected. Moreover, in actual experiments there are inevitable errors and artifacts which greatly contribute increasing *redundancy*.

Table 1 *Redundancy* and the other scores in whole genome shotgun sequencing reported.

# | Organism | Genome size (N) /bp | Average read length (n) /nts | Redundancy (r) | Number of contigs (j) | j/N ×10^{5} | Ref. |
---|---|---|---|---|---|---|---|

1 | Haemophilus influenzae | 1,830,137 | 460 | 6* | - | - | [8] |

2 | Mycoplasma genitalium | 580,070 | 435 | 6.5 | 39 | 6.7 | [7] |

3 | Methanococcus jannaschii | 1.66×10^{6} | 481 | 10.6* | - | - | [9] |

4 | Helicobacter pylori | 1,667,867 | - | - | - | - | [10] |

5 | Archaeoglobus fulgidus | 2,178,400 | 500* | 6.8 | 152 | 7 | [11] |

6 | Borrelia burgdorferi | 910,725 +533,000 | 505 | 7.5 | - | - | [12] |

7 | Treponema pallidum | 1,138,006 | 525 | ~7 | 130^{§} | >=11.4 | [13] |

8 | Methanobacterium thermoautotrophium delta H | 1,751,377 | 361 | ~8.5* | - | - | [14] |

Figure 4. Dependence of the number of contigs per nucleotide (*j/N*) on genome size (*N*). The same symbols are used here as in Figure 3. *Coverage* values: k=0.9, 0.99, 0.999, 0.9999 and 1 from top to bottom.

It is far more significant than the absolute value of *redundancy* itself that there is almost no genome-size dependence in *redundancy* as shown in Figure 3. Thus, we can assume that the *redundancy* in human genome sequencing is not so different from that obtained from bacterial ones (*r*=7), which is a very favorable prediction for whole human genome shotgun sequencing. The number of contigs (*j*) per unit length is also rather independent on the genome size (Figure 4). This also supports the shotgun approach applied to human genome sequencing since the scale of end-closing process, which depends on the number of contigs (*j*), is proportional to the genome size. The experimental values for the number of contigs per nucleotide (*j/N*) indicate that 99% *coverage* was attained in those experiments, which is consistent with the assumption above introduced. (However, it should be noted that the number of contigs in our definition contains even the type of a single sequence while, in most of the works referred, single-sequences were omitted from contigs so that their values should have been counted less. The number of single-sequences is, unfortunately, not included in their papers but can be estimated to be less than *j*.) The number, *j*, enables us to estimate the average size of gaps, *g*, by the following equation:

Therefore, if *N*= 2000000, k= 0.95, *j*= 400 , then *g* = 250. If *N*= 2000000, k= 0.99, *j*=150 , then *g* = 134. Evidently, in the late stage of genome sequencing, the average size of gaps (*g*) is less than the size of a single sequence (*n*; ~400), which well agrees with the observation reported (*g* = less than 300bp) [8].

The most popular objection to the application of shotgun sequencing to human genome seems to come from the misunderstanding that end-closing process is or ought to be also a stochastic one. What is theoretically clear is that if the entire process of whole genome sequencing is subjected to a stochastic process, the increase of genome size (*N*) leads to a tremendous number of *redundancy* before the completion of sequencing as partly suggested in Figure 3 (rightward-increasing curves for k=1). (Plainly, if the probability for a genome to be completely sequenced within *i*-trials is *p* (0<= *p*<=1), then for a genome of the *m*-fold size to be completed within *m*×*i*-trials will be *p ^{m}*(<<1 if

The

Thus, if the cost for a directed process (

The relatively high

The other problems to be encountered in genome sequencing such as a problem of repeated sequences, which are not specific to whole genome shotgun sequencing but common to all strategies [4, 6], will be discussed elsewhere. This paper is intended to pursue a scientific and technological solution to large genome sequencing at large (not only to the present human genome sequencing) and not intended to side with a particular scientific policy.

This research was supported by Grant-in-Aid (#09272203) from the Ministry of Education, Science, Sports and Culture of Japan.

[ 2] Marshall, E.,

[ 3] Waterston, R. and Sulston, J.E.,

[ 4] Green, P.,

[ 5] Lander, E.S. and Waterman, M.S.,

[ 6] Weber, L.J. and Myers, E.M.,

[ 7] Nishigaki, K., Kinoshita, Y., Sakuma, Y., Nakabayashi, Y., Kyono, H ., and Husimi, Y.,

[ 8] Fraser, C.M. et al.,

[ 9] Fleischmann, R.D., et al.,

[ 10] Bult, C.J.,

[ 11] Tomb, J-F.,

[ 12] Klenk, H-P.,

[ 13] Fraser, C.M.,

[ 14] Fraser, C.M.,

[ 15] Smith, D.R.,

[ 16] A paper dealing with this method will be soon published in a journal well circulated (one possibility is

[ 17] The tentative cost estimation ($ 600 million) fell on the amount which is nearly equivalent to that of the three-years-budget of National Human Genome Research Institute (NHGRI) [2] and less than a half of the original estimate (Watson, J.D.,

Return