Achievements report for Graduate Student Researcher Development Grant by Taikichiro Mori Memorial Research Fund for the academic year 2010
Student name: Mohamed Helmy Mahmoud Attia Shehata
Student number: 80725860
Student grade: Ph.D., 2nd grade.
Affiliation: Gradute School for Media and Governance
Research project: Rice Proteogenomics project
Using the Graduate Student Researcher Development Grant by Taikichiro Mori Memorial Research Fund, I was able to improve my research environment and therefore, achieve better results in my research. This resulted in writing three research articles two are already submitted and one is under fonal preparation. In addition I contributed to two international conferences. The research grant helped me to upgrade my workspace by adding new hardware e.g. MacMini and monitor. Add new mobile workstation by purchasing new Apple iPad and some peripheral. Moreover, I used the grant to cover some of the traveling expenses while traveling for the conference and paying the conference registration fees.
1 -Title: OryzaPG-DB: Rice Proteome Database based on Shotgun Proteogenomics
Journal: BMC Plant Biology
Status: Revised version submitted
Abstract: Background: Proteogenomics aims to utilize experimental proteome information for refinement of genome annotation. Since mass spectrometry-based shotgun proteomics approaches provide large-scale peptide sequencing data with high throughput, a data repository for shotgun proteogenomics would represent a valuable source of gene expression evidence at the translational level for genome re-annotation. Description: Here, we present OryzaPG-DB, a rice proteome database based on shotgun proteogenomics, which incorporates the genomic features of experimental shotgun proteomics data. This version of the database was created from the results of 27 nanoLC-MS/MS runs on a hybrid ion trap-orbitrap mass spectrometer, which offers high accuracy for analyzing tryptic digests from undifferentiated cultured rice cells. Peptides were identified by searching the product ion spectra against the protein, cDNA, transcript and genome databases from Michigan State University, and were mapped to the rice genome. Approximately 3200 genes were covered by these peptides and 40 of them contained novel genomic features. Users can search, download or navigate the database per chromosome, gene, protein, cDNA or transcript and download the updated annotations in standard GFF3 format, with visualization in PNG format. In addition, the database scheme of OryzaPG was designed to be generic and can be reused to host similar proteogenomic information for other species. OryzaPG is the first proteogenomics-based database of the rice proteome, providing peptide-based expression profiles, together with the corresponding genomic origin, including the annotation of novelty for each peptide. Conclusions: The OryzaPG database was constructed and is freely available at http://oryzapg.iab.keio.ac.jp/.
2 -Title: Peptide identification by searching large-scale tandem mass spectra against large databases: bioinformatics methods in proteogenomics
Journal: Journal of genes, genome and genomics
Abstract: Mass spectrometry-based shotgun proteomics approaches are currently considered as the technology-of-choice for large-scale proteogenomics due to high throughput, good availability and relative ease of use. Protein mixtures are firstly digested with protease, e. g. trypsin, and the resultant peptides are analyzed using liquid chromatography - tandem mass spectrometry. Proteins and peptides are identified from the resultant tandem mass spectra by de novo interpretation of the spectra or by searching databases of putative sequences. Since this data represents the expressed proteins in the sample, it can be used to infer novel proteogenomic features when mapped to the genome. However, high-throughput mass spectrometry instruments can readily generate hundreds of thousands, perhaps millions, of spectra and the size of genomic databases, such as six-frame translated genome databases, is enormous. Therefore, computational demands are very high, and there is potential inaccuracy in peptide identification due to the large search space. These issues are considered the main challenges that limit the utilization of this approach. In this review, we highlight the efforts of the proteomics and bioinformatics communities to develop methods, algorithms and software tools that facilitate peptide sequence identification from databases in large-scale proteogenomic studies.
3 -Title: Mass Spectrum Sequential Subtraction: bioinformatics method facilitates searching large dataset of peptide MS/MS spectra against large nucleotide databases for proteogenomicsmethods in proteogenomics
Status: Under preparation
We developed MSSS (Mass Spectrum Sequential Subtraction), a novel bioinformatics method to compare the large datasets
of peptide spectra produced by liquid chromatography-mass spectrometry
International Conferences Contributions
1- The 9th Annual World Congress of the Human Proteome Organization (HUPO2010). Sydney, Australia.
Date and Venue: September 2010, Sydney Convention and Exhibition Center, Sydney, Australia.
Title: Onco-proteogenomics: toward the identification of oncogenic peptides and proteins.
Abstract: Accumulation of somatic mutation is a common property in all cancer genomes. These mutations include several patterns of mutagenesis such as small insertions, chromosomal rearrangement and nucleotide substitutions. Consequently, the mutated genomes produce mutant proteins that give the cancer cell its oncogenic properties. For such mutated proteins, however, mass spectrometry-based identification by shotgun proteomics is generally difficult because the identification is dependent on databases containing normal proteins or hybrid database with normal and mutated proteins. Here, we present ’onco-proteogenomics’, a novel proteogenomics approach to identify the cancer-related peptides and proteins using four databases containing normal sequences (Human protein, cDNA, mRNA and genome databases) and one cancer-driven database (cancer EST database). We applied our approach to MS/MS data obtained from 15 nanoLC-MS/MS runs of HeLa S3 cell phosphoproteome. Human protein, cDNA, mRNA and genome databases were used for Mascot peptide identification. Following each identification, we subtracted all MS/MS spectra corresponding to the identified peptides to exclude all spectra matched to peptides expressed from normal sequences. Next, we constructed HeLa S3 EST database by combining HeLa S3 ESTs generated in seven different studies. The constructed database contains over 60,000 entries. For the remaining unidentified MS/MS spectra, we performed the Mascot search against this EST database. Consequently, we were able to identify 25 oncogenic peptides including phosphorylated sites. As a future work, we will apply the same approach in different cancers aiming to identify global cancer biomarkers and drug targets.
2- Beyond the Genome: The true gene count, human evolution and disease genomics (BTG2010)
Date and Venue: October 2010, Harvard Medical School, Boston, MA, USA..
Title: Onco-proteogenomics: A novel approach for identification of cancer-specific mutation combining proteomics and transcriptome deep sequencing
Abstract: The accumulation of somatic mutation is a common property in all cancer genomes. These mutations include several patterns of mutagenesis such as small insertions, chromosomal rearrangement and nucleotide substitutions. Consequently, the mutated genomes produce mutant transcriptome and, therefore, mutant proteins that give the cancer cell its oncogenic properties . For such mutated proteins, however, mass spectrometry-based identification by shotgun proteomics is generally difficult because the identification is dependent on databases containing normal proteins or hybrid database with normal and mutated proteins. Here, we present ’onco-proteogenomics’, a novel proteogenomics approach to identify the cancer-related peptides (phospho- and non-phospho peptides) and proteins.We analyzed HeLa S3 cells as a test sample by shotgun phosphoproteomics and the obtained data was analyzed byan extended version of MSRI (MS Spectra Reduction after Identification), the proteogenomic approach that we used before in the identification of novel genomic features in Rice plant . In onco-proteogenomics, we used four databases containing normal sequences (Human protein, cDNA, mRNA and genome databases) for Mascot peptide identification and removed all the MS/MS spectra corresponds to all identified peptides. The reminder MS/MS spectra were searched against one cancer-driven database obtained by deep sequencing of HeLa S3 cells to identify cancer-specific peptides. The four databases, which contain normal sequences, were used sequentially in the peptide identification to identify all potential peptide sequences that can be generated from the normal genome. This includes the potential protein sequences, junction-peptides and exon-skipping peptides (protein and cDNA databases), exonic peptides (mRNA database) and extragenic peptides (genome database). Following each database Mascot search, we removed all the MS/MS spectra correspond to the identified peptide sequences and created new files containing the reminder MS/MS spectra. Next, we constructed HeLa S3 transcriptome database with data obtained from deep sequencing of HeLa S3 cells (NCBI UniGene Database). The constructed database contains over 60,000 entries. For the remaining unidentified MS/MS spectra, we performed the Mascot search against this transcriptome database. Consequently, we were able to identify 25 cancer-specific peptides including phosphorylated sites. For further check, the identified peptides were aligned to the employed normal databases using NCBI BLAST. The alignment did not show any significant matches indicating that these peptides are specifically expressed in the HeLa S3 cancer cell-line. As a future work, we will apply the same approach in different cancers aiming to identify global cancer biomarkers and drug targets