2006年度森泰吉郎記念研究振興基金活動報告書

 

 

大規模ゲノムデータからの生体内分子ネットワーク構築

 

 

慶應義塾大学政策・メディア研究科修士課程2

谷内江 望

nzm at sfc.keio.ac.jp

 

 

 

1) Alignment-Based Approach for Durable Data Storage into Living Organisms.

 

Yachie N, Sekiyama K, Sugahara J, Ohashi Y, Tomita M.

 

The practical realization of DNA data storage is a major scientific goal. Here we introduce a simple, flexible, and robust data storage and retrieval method based on sequence alignment of the genomic DNA of living organisms. Duplicated data encoded by different oligonucleotide sequences was inserted redundantly into multiple loci of the Bacillus subtilis genome. Multiple alignment of the bit data sequences decoded by B. subtilis genome sequences enabled the retrieval of stable and compact data without the need for template DNA, parity checks, or error-correcting algorithms. Combined with the computational simulation of data retrieval from mutated message DNA, a practical use of this alignment-based method is discussed.

 

*Biotechnology Progress, in press

 

 

 

2) In silico screening of archaeal tRNA-encoding genes having multiple introns with bulge-helix-bulge splicing motifs.

 

Sugahara J, Yachie N, Arakawa K, Tomita M.

 

In archaeal species, several transfer RNA genes have been reported to contain endogenous introns. Although most of the introns are located at anticodon loop regions between nucleotide positions 37 and 38, a number of introns at noncanonical sites and six cases of tRNA genes containing two introns have also been documented. However, these tRNA genes are often missed by tRNAscan-SE, the software most widely used for the annotation of tRNA genes. We previously developed SPLITS, a computational tool to identify tRNA genes containing one intron at a noncanonical position on the basis of its discriminative splicing motif, but the software was limited in the detection of tRNA genes with multiple introns at noncanonical sites. In this study, we initially updated the system as SPLITSX in order to correctly predict known tRNA genes as well as novel ones with multiple introns. By a comprehensive search for tRNA genes in 29 archaeal genomes using SPLITSX, we listed 43 novel candidates that contain introns at noncanonical sites. As a result, 15 contained two introns and three contained three introns within the respective putative tRNA genes. Moreover, the candidates completely complemented all the codons of two archaeal species of uncultured methanogenic archaeon, RC-I and Thermofilum pendens Hrk 5, with novel candidates that were not detectable by tRNAscan-SE alone.

 

*RNA, in press

 

 

 

3) eXpanda: an Integrated Platform for Network Analysis and Visualization.

 

Negishi Y, Nakamura H, Yachie N, Saito R, Tomita M.

 

Analysis and visualization of biological networks, such as protein-protein and protein-DNA interactions, are crucially important toward obtaining a thorough understanding of living systems. Here, we present an integrative software platform, eXpanda, which enables an analysis of a very broad range of biological networks, with a special focus on the extraction of characteristic topologies which potentially function as units in the networks. eXpanda is provided as a Perl library which gives full-automatic connections to various biological databases via a Perl programmable interface and can perform topological analysis based on graph theory. The results of these analyses are visualizable by vector graphics. eXpanda is under GNU General Public License. Software package, detailed documentations, source codes, and some sample scripts are downloadable at http://medcd.iab.keio.ac.jp/expanda/.

 

*In Silico Biology; 7: 0013

 

 

 

4) On the interplay of gene positioning and the role of rho-independent terminators in Escherichia coli.

 

Yachie N, Arakawa K, Tomita M.

 

The majority of intrinsic rho-independent terminator signals, reported to consist of stable hairpin structures followed by T-rich regions, possess the potential to operate bi-directionally and to induce transcription terminations on both strands of the DNA duplex in Escherichia coli. By using RNAMotif software, we investigated the distributions of termination motifs around the 3'-ends of overlapping and non-overlapping genes at the genomic level. We suggest that the positions of compactly encoded E. coli genes and rho-independent terminators are optimized to terminate the adjoining genes on their antisense strands efficiently, and not to mis-terminate overlapping transcripts, due to their bi-directional properties.

 

*FEBS Letters; 580(30): 6909-6914

 

 

 

5) HybGFS: a hybrid method for genome-fingerprint scanning.

 

Shinoda K, Yachie N, Masuda T, Sugiyama N, Sugimoto M, Soga T, Tomita M.

 

BACKGROUND: Protein identification based on mass spectrometry (MS) has previously been performed using peptide mass fingerprinting (PMF) or tandem MS (MS/MS) database searching. However, these methods cannot identify proteins that are not already listed in existing databases. Moreover, the alternative approach of de novo sequencing requires costly equipment and the interpretation of complex MS/MS spectra. Thus, there is a need for novel high-throughput protein-identification methods that are independent of existing predefined protein databases. RESULTS: Here, we present a hybrid method for genome-fingerprint scanning, known as HybGFS. This technique combines genome sequence-based peptide MS/MS ion searching with liquid-chromatography elution-time (LC-ET) prediction, to improve the reliability of identification. The hybrid method allows the simultaneous identification and mapping of proteins without a priori information about their coding sequences. The current study used standard LC-MS/MS data to query an in silico-generated six-reading-frame translation and the enzymatic digest of an entire genome. Used in conjunction with precursor/product ion-mass searching, the LC-ETs increased confidence in the peptide-identification process and reduced the number of false-positive matches. The power of this method was demonstrated using recombinant proteins from the Escherichia coli K12 strain. CONCLUSION: The novel hybrid method described in this study will be useful for the large-scale experimental confirmation of genome coding sequences, without the need for transcriptome-level expression analysis or costly MS database searching.

 

*BMC Bioinformatics; 7: 479

 

 

 

6) Prediction of liquid chromatographic retention times of peptides generated by protease digestion of the Escherichia coli proteome using artificial neural networks.

 

Shinoda K, Sugimoto M, Yachie N, Sugiyama N, Masuda T, Robert M, Soga T, Tomita M.

 

We developed a computational method to predict the retention times of peptides in HPLC using artificial neural networks (ANN). We performed stepwise multiple linear regressions and selected for ANN input amino acids that significantly affected the LC retention time. Unlike conventional linear models, the trained ANN accurately predicted the retention time of peptides containing up to 50 amino acid residues. In 834 peptides, there was a strong correlation (R2 = 0.928) between measured and predicted retention times. We demonstrated the utility of our method by the prediction of the retention time of 121,273 peptides resulting from LysC-digestion of the Escherichia coli proteome. Our approach is useful for the proteome-wide characterization of peptides and the identification of unknown peptide peaks obtained in proteome analysis.

 

*Journal of Proteome Research; 5(12):3312-3317

 

 

 

7) SPLITS: a new program for predicting split and intron-containing tRNA genes at the genome level.

 

Sugahara J, Yachie N, Sekine Y, Soma A, Matsui M, Tomita M, Kanai A.

 

In the archaea, some tRNA precursors contain intron(s) not only in the anticodon loop region but also in diverse sites of the gene (intron-containing tRNA or cis-spliced tRNA). The parasite Nanoarchaeum equitans, a member of the Nanoarchaeota kingdom, creates functional tRNA from separate genes, one encoding the 5'-half and the other the 3'-half (split tRNA or trans-spliced tRNA). Although recent genome projects have revealed a huge amount of nucleotide sequence data in the archaea, a comprehensive methodology for intron-containing and split tRNA searching is yet to be established. We therefore developed SPLITS, which is aimed at searching for any type of tRNA gene and is especially focused on intron-containing tRNAs or split tRNAs at the genome level. SPLITS initially predicts the bulge-helix-bulge splicing motif (a well-known, required structure in archaeal pre-tRNA introns) to determine and remove the intronic regions of tRNA genes. The intron-removed DNA sequences are automatically queried to tRNAscan-SE. SPLITS can predict known tRNAs with single introns located at unconventional sites on the genes (100%), tRNAs with double introns (85.7%), and known split tRNAs (100%). Our program will be very useful for identifying novel tRNA genes after completion of genome projects. The SPLITS source code is freely downloadable at http://splits.iab.keio.ac.jp/.

 

*In Silico Biology; 6: 0039

 

 

 

8) Prediction of non-coding and antisense RNA genes in Escherichia coli with Gapped Markov Model.

 

Yachie N, Numata K, Saito R, Kanai A, Tomita M.

 

A new mathematical index was developed to identify and characterize non-coding RNA (ncRNA) genes encoded within the Escherichia coli (E. coli) genome. It was designated the GMMI (Gapped Markov Model Index) and used to evaluate sequence patterns located at the separate positions of consensus sequences, codon biases and/or possible RNA structures on the basis of the Markov model. The GMMI was able to separate a set of known mRNA sequences from a mixture of ncRNAs including tRNAs and rRNAs. Consequently, the GMMI was employed to predict novel ncRNA candidates. At the beginning, possible transcription units were extracted from the E. coli genome using consensus sequences for the sigma70 promoter and the rho-independent terminator. Then, these units were evaluated by using the GMMI. This identified 133 candidate ncRNAs, which contain 29 previously annotated small RNA genes and 46 possible antisense ncRNAs. Furthermore 12 transcripts (including five antisense RNAs) were confirmed according to the expression analysis. These data suggests that the expression of small antisense RNAs might be more common than previously thought in the E. coli genome.

 

*Gene; 372: 171-181

 

 

 

9) Computational analysis of microRNA targets in Caenorhabditis elegans.

 

Watanabe Y, Yachie N, Numata K, Saito R, Kanai A, Tomita M.

 

MicroRNAs (miRNAs) are endogenous approximately 22-nucleotide (nt) non-coding RNAs that post-transcriptionally regulate the expression of target genes via hybridization to target mRNA. Using known pairs of miRNA and target mRNA in Caenorhabditis elegans, we first performed computational analysis for specific hybridization patterns between these two RNAs. We counted the numbers of perfectly complementary dinucleotide sequences and calculated the free energy within complementary base pairs of each dinucleotide, observed by sliding a 2-nt window along all nucleotides of the miRNA-mRNA duplex. We confirmed not only strong base pairing within the 5' region of miRNAs (nts 1-8) in C. elegans, but also the required mismatch within the central region (nt 9 or nt 10), and we found weak binding within the 3' region (nts 13-14). We also predicted 687 possible miRNA target transcripts, many of which are thought to be involved in C. elegans development, by combining the above mentioned hybridization tendency with the following analyses: (1) prediction of the miRNA-mRNA duplex with free-energy minimization; (2) identification of the complementary pattern within the miRNA-mRNA duplex; (3) conservation of target sites between C. elegans and C. briggsae, a related soil nematode; and (4) extraction of mRNA candidates with multiple target sites. Rigorous tests using shuffled miRNA controls supported these predictions. Our results suggest that miRNAs recognize their target mRNAs by their hybridization pattern and that many target mRNAs may be regulated through a combination of several specific miRNA target sites in C. elegans.

 

*Gene; 365: 2-10

 

 

 

Copyright © 2007, Nozomu Yachie, Institute for Advanced Biosciences, Keio University