¿¹ÂÙµÈϺµ­Ç°¸¦µæ¿¶¶½´ð¶âÀ®²ÌÊó¹ð½ñ¡Ê2005¡Ë - À¯ºö¡¦¥á¥Ç¥£¥¢¸¦µæ²Ê Æ£¿¹ÌÐͺ

Statistical Characterization of Transcription Start Sites in Plant Genomes

¡½¿¢Êª¤Îž¼Ì³«»ÏÉô°Ì¤Ë´Ø¤¹¤ëÆÃħÃê½Ð¤Èž¼Ì³«»ÏÅÀͽ¬¼êË¡¤Î³«È¯¡½
Shigeo Fujimori
Institute for Advanced Biosciences, Keio University
Graduate School of Media and Governance, Keio University
fujimori@sfc.keio.ac.jp washy@sfc.keio.ac.jp

1 Introduction

Although large amounts of genomic and full-length cDNA sequence data from plants are now publicly available, knowledge of the promoters and transcription start sites (TSSs) in plants is still limited compared to mammals, such as human and mouse. In a recent paper, a prominent GC-compositional strand bias or GC-skew [=(C-G)/(C+G)], where C and G denote the numbers of cytosine and guanine residues, was reported near the transcription start sites in Arabidopsis thaliana [6]. However, it is unclear whether other eukaryotic species have equally prominent GC-skews, and the biological meaning of this trait remains unknown. In this study, we conducted comparative analysis using sequences from various eukaryotic genomes - animals, fungi, protists, and plants, to statistically characterize TSSs of plant genes. In addition, we explored the potential value of GC-skew as an index for TSS-prediction in plants genomes, where there is a lack of correlation among CpG -islands and genes.

2 Materials and Methods

We used full-length cDNAs for Arabidopsis thaliana, Oryza sativa (rice), Homo sapiens, and Drosophila melanogaster and genomic sequences of those species. TSS positions were determined by mapping full-length cDNAs to those genomic sequences. In addition, genomic sequences for six fungi were used to investigate the regions around the translation initiation sites (TISs). To assess the shift in the GC-skew value around the TSSs (or TISs), we calculated GC-skew values for the regions 1.0-kb upstream to 0.5-kb downstream from the TSSs (or TISs). The sliding window technique was used, where the GC-skew value in each 100-bp window was computed as a value at the center position. The GC-skew values at each position of all genes were averaged.

3 Results and Discussion

Our study confirmed a significant GC-skew (C > G) in the TSSs of rice genes. The full-length cDNAs and genomic sequences from Arabidopsis and rice were compared using statistical analyses. Intriguingly, despite marked differences in the G+C content around the TSSs in the two plants, the degrees of bias were almost identical (Fig. 1). Although slight GC-skew peaks, including opposite skews (C < G), were detected around the TSSs of genes in human and Drosophila, they were qualitatively and quantitatively different from those identified in plants.


Figure 1: GC-skew in up- and downstream regions of the TSSs.

However, plant-like GC-skew in regions upstream of the TISs in some fungi was identified following analyses of the expressed sequence tags (ESTs) and/or genomic sequences from other species. On the basis of our dataset, we estimated that >70 and 68% of Arabidopsis and rice genes, respectively, had a strong GC-skew (>0.33) in a 100-bp window (that is, the number of C residues was more than double the number of G residues in a +/-100-bp window around the TSS). The mean GC-skew value in the TSSs of highly-expressed genes in Arabidopsis was significantly greater (P=0.0003, t-test) than that of genes with low expression levels. We therefore propose that the GC-skew around the TSSs in some plants and fungi is related to transcription. It might be caused by mutations during transcription initiation or the frequent use of transcription factor-biding sites having a strand preference. It has been reported that the CpG island [3] is the most effective index for predicting the promoter regions or TSSs in mammals [1,4]. However, the CpG islands are not specifically located in the promoter regions in Arabidopsis, so they cannot be used for the prediction of TSSs or promoters [5]. Identifying another, more suitable, index for the prediction of plant-specific TSSs has therefore become a priority. Many of the GC-skew peaks were preferentially located near the TSSs, so we examined the potential value of GC-skew as an index for TSS identification. Our results confirmed that the GC-skew can be used to assist the TSS-prediction in plant genomes (see Fig. 2 as an example case). In our past study, we have already shown a novel plant-specific feature in which there is a gradient in microsatellite density along the direction of transcription [2] (Fig. 3). In that study, it was revealed that some types of those microsatellites exist especially in 5'-UTRs of plant genes. Because the TSS-prediction depending only on GC-skew might not be sufficient to achieve accurate prediction, the combined use of the GC-skew and other appropriate indices such as microsatellite appears to be a realistic and effective approach for improving prediction accuracy. Therefore, we are now exploring the TSS-prediction by the probabilistic model that includes multiple indices in addition to GC-skew.


Figure 3: The microsatellite density in transcribed regions and their up-/downstream regions in (a) rice and (b) Arabidopsis.


Figure 2: An example case of GC-skew peak located near the TSS of plant gene. Solid triangle represents the predicted TSS position.

References

[1] Bajic VB, Tan SL, Suzuki Y and Sugano S., Promoter prediction analysis on the whole human genome. Nat Biotechnol, 22:1467-1473, 2004.
[2] Fujimori S., Washio T., Higo K., et al., A novel feature of microsatellites in plants: a distribution gradient along the direction of transcription. FEBS Lett, 6;554(1-2):17-22, 2003.
[3] Gardiner-Garden M. and Frommer M., CpG islands in vertebrate genomes, J Mol Biol, 196:261-82, 1987.
[4] Hannenhalli S. and Levy S., Promoter prediction in the human genome, Bioinformatics, 17 Suppl 1:S90-6, 2001.
[5] Rombauts S., Florquin K., Lescot M., et al., Computational approaches to identify promoters and cis-regulatory elements in plant genomes. Plant Physiol, 132:1162-76, 2003.
[6] Tatarinova T., Brover V., Troukhan M. and Alexandrov N., Skew in CG content near the transcription start site in Arabidopsis thaliana, Bioinformatics, 19 Suppl 1:I313-I314, 2003.