Achievements report for Graduate Student Researcher Development Grant by Taikichiro Mori Memorial Research Fund for the academic year 2010

Student name: Mohamed Helmy Mahmoud Attia Shehata

Student number: 80725860

Student grade: Ph.D., 2^nd grade.

Affiliation: Gradute School for Media and Governance

Research project: Rice Proteogenomics project

Introduction

Using the Graduate Student Researcher Development Grant by Taikichiro Mori Memorial Research Fund, I was able to improve my research environment and therefore, achieve better results in my research. This resulted in writing three research articles two are already submitted and one is under fonal preparation. In addition I contributed to two international conferences. The research grant helped me to upgrade my workspace by adding new hardware e.g. MacMini and monitor. Add new mobile workstation by purchasing new Apple iPad and some peripheral. Moreover, I used the grant to cover some of the traveling expenses while traveling for the conference and paying the conference registration fees.

Research Articles

1 -Title: OryzaPG-DB: Rice Proteome Database based on Shotgun Proteogenomics

Journal: BMC Plant Biology

Status: Revised version submitted

Abstract: Background: Proteogenomics aims to utilize experimental proteome information for refinement of genome annotation. Since mass spectrometry-based shotgun proteomics approaches provide large-scale peptide sequencing data with high throughput, a data repository for shotgun proteogenomics would represent a valuable source of gene expression evidence at the translational level for genome re-annotation. Description: Here, we present OryzaPG-DB, a rice proteome database based on shotgun proteogenomics, which incorporates the genomic features of experimental shotgun proteomics data. This version of the database was created from the results of 27 nanoLC-MS/MS runs on a hybrid ion trap-orbitrap mass spectrometer, which offers high accuracy for analyzing tryptic digests from undifferentiated cultured rice cells. Peptides were identified by searching the product ion spectra against the protein, cDNA, transcript and genome databases from Michigan State University, and were mapped to the rice genome. Approximately 3200 genes were covered by these peptides and 40 of them contained novel genomic features. Users can search, download or navigate the database per chromosome, gene, protein, cDNA or transcript and download the updated annotations in standard GFF3 format, with visualization in PNG format. In addition, the database scheme of OryzaPG was designed to be generic and can be reused to host similar proteogenomic information for other species. OryzaPG is the first proteogenomics-based database of the rice proteome, providing peptide-based expression profiles, together with the corresponding genomic origin, including the annotation of novelty for each peptide. Conclusions: The OryzaPG database was constructed and is freely available at http://oryzapg.iab.keio.ac.jp/.

2 -Title: Peptide identification by searching large-scale tandem mass spectra against large databases: bioinformatics methods in proteogenomics

Journal: Journal of genes, genome and genomics

Status: Submitted

Abstract: Mass spectrometry-based shotgun proteomics approaches are currently considered as the technology-of-choice for large-scale proteogenomics due to high throughput, good availability and relative ease of use. Protein mixtures are firstly digested with protease, e. g. trypsin, and the resultant peptides are analyzed using liquid chromatography - tandem mass spectrometry. Proteins and peptides are identified from the resultant tandem mass spectra by de novo interpretation of the spectra or by searching databases of putative sequences. Since this data represents the expressed proteins in the sample, it can be used to infer novel proteogenomic features when mapped to the genome. However, high-throughput mass spectrometry instruments can readily generate hundreds of thousands, perhaps millions, of spectra and the size of genomic databases, such as six-frame translated genome databases, is enormous. Therefore, computational demands are very high, and there is potential inaccuracy in peptide identification due to the large search space. These issues are considered the main challenges that limit the utilization of this approach. In this review, we highlight the efforts of the proteomics and bioinformatics communities to develop methods, algorithms and software tools that facilitate peptide sequence identification from databases in large-scale proteogenomic studies.

3 -Title: Mass Spectrum Sequential Subtraction: bioinformatics method facilitates searching large dataset of peptide MS/MS spectra against large nucleotide databases for proteogenomicsmethods in proteogenomics

Status: Under preparation

Abstract: We developed MSSS (Mass Spectrum Sequential Subtraction), a novel bioinformatics method to compare the large datasets of peptide spectra produced by liquid chromatography-mass spectrometry (LC-MS/MS) against series of protein and large-sized nucleotide sequence databases to find novel genomic features. The main principle in MSSS is to search the peptide spectra set against the protein database then remove the spectra corresponding to the identified peptides and search the remaining peptide spectra against the nucleotide sequences database. Therefore, we reduce the number of spectra to be searched instead of the database reduction approaches followed by other methods to limit the peptide search space. Comparing the search time, computational demands, accuracy and peptide identification capability of MSSS and the conventional search approach, that does not include spectra subtraction step, MSSS reduced the search queries to 50% and the search time to 75% on average. Further, MSSS didn’t affect the false positive rate (FPR) of the identification and showed comparable ability of peptides sequence identification. We used MSSS in analyzing our 27 LC-MS/MS runs of rice cultured cells against the rice protein, cDNA, transcript and genome databases resulting in the identification of 346 novel peptides that are not existing in any annotated protein. The identified peptides were used perform proteogenomic analysis to the rice genome annotation and resulted in pointing new genomic features in 89 genes. Furthermore, 30 frame shifts were observed and 112 peptides were mapped to intergenic regions indicating the possibility of the existence of novel non-annotated genes. These results show the utility of MSSS in searching large-sized databases with large-scale MS/MS datasets for proteogenomics.

International Conferences Contributions

1- The 9th Annual World Congress of the Human Proteome Organization (HUPO2010). Sydney, Australia.

Date and Venue: September 2010, Sydney Convention and Exhibition Center, Sydney, Australia.

Title: Onco-proteogenomics: toward the identification of oncogenic peptides and proteins.

Abstract: Accumulation of somatic mutation is a common property in all cancer genomes. These mutations include several patterns of mutagenesis such as small insertions, chromosomal rearrangement and nucleotide substitutions. Consequently, the mutated genomes produce mutant proteins that give the cancer cell its oncogenic properties. For such mutated proteins, however, mass spectrometry-based identification by shotgun proteomics is generally difficult because the identification is dependent on databases containing normal proteins or hybrid database with normal and mutated proteins. Here, we present ’onco-proteogenomics’, a novel proteogenomics approach to identify the cancer-related peptides and proteins using four databases containing normal sequences (Human protein, cDNA, mRNA and genome databases) and one cancer-driven database (cancer EST database). We applied our approach to MS/MS data obtained from 15 nanoLC-MS/MS runs of HeLa S3 cell phosphoproteome. Human protein, cDNA, mRNA and genome databases were used for Mascot peptide identification. Following each identification, we subtracted all MS/MS spectra corresponding to the identified peptides to exclude all spectra matched to peptides expressed from normal sequences. Next, we constructed HeLa S3 EST database by combining HeLa S3 ESTs generated in seven different studies. The constructed database contains over 60,000 entries. For the remaining unidentified MS/MS spectra, we performed the Mascot search against this EST database. Consequently, we were able to identify 25 oncogenic peptides including phosphorylated sites. As a future work, we will apply the same approach in different cancers aiming to identify global cancer biomarkers and drug targets.

Poster (PDF)

2- Beyond the Genome: The true gene count, human evolution and disease genomics (BTG2010)

Date and Venue: October 2010, Harvard Medical School, Boston, MA, USA..

Title: Onco-proteogenomics: A novel approach for identification of cancer-specific mutation combining proteomics and transcriptome deep sequencing

Abstract: The accumulation of somatic mutation is a common property in all cancer genomes. These mutations include several patterns of mutagenesis such as small insertions, chromosomal rearrangement and nucleotide substitutions. Consequently, the mutated genomes produce mutant transcriptome and, therefore, mutant proteins that give the cancer cell its oncogenic properties [1]. For such mutated proteins, however, mass spectrometry-based identification by shotgun proteomics is generally difficult because the identification is dependent on databases containing normal proteins or hybrid database with normal and mutated proteins. Here, we present ’onco-proteogenomics’, a novel proteogenomics approach to identify the cancer-related peptides (phospho- and non-phospho peptides) and proteins.We analyzed HeLa S3 cells as a test sample by shotgun phosphoproteomics and the obtained data was analyzed byan extended version of MSRI (MS Spectra Reduction after Identification), the proteogenomic approach that we used before in the identification of novel genomic features in Rice plant [2]. In onco-proteogenomics, we used four databases containing normal sequences (Human protein, cDNA, mRNA and genome databases) for Mascot peptide identification and removed all the MS/MS spectra corresponds to all identified peptides. The reminder MS/MS spectra were searched against one cancer-driven database obtained by deep sequencing of HeLa S3 cells to identify cancer-specific peptides. The four databases, which contain normal sequences, were used sequentially in the peptide identification to identify all potential peptide sequences that can be generated from the normal genome. This includes the potential protein sequences, junction-peptides and exon-skipping peptides (protein and cDNA databases), exonic peptides (mRNA database) and extragenic peptides (genome database). Following each database Mascot search, we removed all the MS/MS spectra correspond to the identified peptide sequences and created new files containing the reminder MS/MS spectra. Next, we constructed HeLa S3 transcriptome database with data obtained from deep sequencing of HeLa S3 cells (NCBI UniGene Database). The constructed database contains over 60,000 entries. For the remaining unidentified MS/MS spectra, we performed the Mascot search against this transcriptome database. Consequently, we were able to identify 25 cancer-specific peptides including phosphorylated sites. For further check, the identified peptides were aligned to the employed normal databases using NCBI BLAST. The alignment did not show any significant matches indicating that these peptides are specifically expressed in the HeLa S3 cancer cell-line. As a future work, we will apply the same approach in different cancers aiming to identify global cancer biomarkers and drug targets

Poster (PDF-Online)