Achievements 
report for Graduate Student Researcher Development Grant by Taikichiro Mori Memorial Research Fund for the academic 
year 2010
Student 
name: Mohamed Helmy Mahmoud Attia Shehata
Student 
number: 80725860
Student 
grade: Ph.D., 2nd grade.
Affiliation: Gradute School for Media and Governance
Research 
project: Rice Proteogenomics project
 
Introduction
Using 
the Graduate Student Researcher Development Grant by Taikichiro Mori Memorial Research Fund, I was able to 
improve my research environment and therefore, achieve better results in my 
research. This resulted in writing three research articles two are already 
submitted and one is under fonal preparation. In 
addition I contributed to two international conferences. The research grant 
helped me to upgrade my workspace by adding new hardware e.g. MacMini and monitor. Add new mobile workstation by 
purchasing new Apple iPad and some peripheral. 
Moreover, I used the grant to cover some of the traveling expenses while 
traveling for the conference and paying the conference registration 
fees.
Research 
Articles
1 
-Title: OryzaPG-DB: Rice Proteome Database based on 
Shotgun Proteogenomics
Journal: 
BMC Plant Biology
Status: 
Revised version submitted 
Abstract: 
Background: Proteogenomics aims to utilize experimental proteome 
information for refinement of genome annotation. Since mass spectrometry-based 
shotgun proteomics approaches provide large-scale peptide sequencing data with 
high throughput, a data repository for shotgun proteogenomics would represent a 
valuable source of gene expression evidence at the translational level for 
genome re-annotation.  
Description: Here, we present OryzaPG-DB, a rice proteome database based on shotgun 
proteogenomics, which incorporates the genomic features of experimental shotgun 
proteomics data. This version of the database was created from the results of 27 
nanoLC-MS/MS runs on a hybrid ion trap-orbitrap mass spectrometer, which offers high accuracy for 
analyzing tryptic digests from undifferentiated cultured rice cells. Peptides 
were identified by searching the product ion spectra against the protein, cDNA, 
transcript and genome databases from Michigan State University, and were mapped 
to the rice genome. Approximately 3200 genes were covered by these peptides and 
40 of them contained novel genomic features. Users can search, download or 
navigate the database per chromosome, gene, protein, cDNA or transcript and 
download the updated annotations in standard GFF3 format, with visualization in 
PNG format. In addition, the database scheme of OryzaPG was designed to be generic and can be reused to host 
similar proteogenomic information for other species. OryzaPG is the first proteogenomics-based database of the 
rice proteome, providing peptide-based expression profiles, together with the 
corresponding genomic origin, including the annotation of novelty for each 
peptide. Conclusions: The OryzaPG database was 
constructed and is freely available at http://oryzapg.iab.keio.ac.jp/. 
2 
-Title: Peptide identification by searching large-scale tandem mass spectra 
against large databases: bioinformatics methods in 
proteogenomics
Journal: 
Journal of genes, genome and genomics
Status: 
Submitted 
Abstract: 
Mass spectrometry-based shotgun proteomics approaches are currently considered 
as the technology-of-choice for large-scale proteogenomics due to high 
throughput, good availability and relative ease of use. Protein mixtures are 
firstly digested with protease, e. g. trypsin, and the resultant peptides are 
analyzed using liquid chromatography - tandem mass spectrometry. Proteins and 
peptides are identified from the resultant tandem mass spectra by de novo 
interpretation of the spectra or by searching databases of putative sequences. 
Since this data represents the expressed proteins in the sample, it can be used 
to infer novel proteogenomic features when mapped to the genome. However, 
high-throughput mass spectrometry instruments can readily generate hundreds of 
thousands, perhaps millions, of spectra and the size of genomic databases, such 
as six-frame translated genome databases, is enormous. Therefore, computational 
demands are very high, and there is potential inaccuracy in peptide 
identification due to the large search space. These issues are considered the 
main challenges that limit the utilization of this approach. In this review, we 
highlight the efforts of the proteomics and bioinformatics communities to 
develop methods, algorithms and software tools that facilitate peptide sequence 
identification from databases in large-scale proteogenomic studies.      
3 
-Title: Mass Spectrum Sequential Subtraction: bioinformatics method facilitates 
searching large dataset of peptide MS/MS spectra against large nucleotide 
databases for proteogenomicsmethods in 
proteogenomics
Status: 
Under preparation  
Abstract: 
We developed MSSS (Mass Spectrum Sequential Subtraction), a novel bioinformatics method to compare the large datasets 
of peptide spectra produced by liquid chromatography-mass spectrometry 
(LC-MS/MS) a
International 
Conferences Contributions
1- 
The 9th Annual World Congress of the Human Proteome Organization (HUPO2010). 
Sydney, Australia.
Date 
and Venue: September 2010, Sydney Convention and Exhibition Center,  Sydney, 
Australia. 
Title: 
Onco-proteogenomics: 
toward the identification of oncogenic peptides and 
proteins.  
Abstract: 
Accumulation 
of somatic mutation is a common property in all cancer genomes. These mutations 
include several patterns of mutagenesis such as small insertions, chromosomal 
rearrangement and nucleotide substitutions. Consequently, the mutated genomes 
produce mutant proteins that give the cancer cell its oncogenic properties. For such mutated proteins, however, 
mass spectrometry-based identification by shotgun proteomics is generally 
difficult because the identification is dependent on databases containing normal 
proteins or hybrid database with normal and mutated proteins. Here, we present 
’onco-proteogenomics’, a novel proteogenomics approach 
to identify the cancer-related peptides and proteins using four databases 
containing normal sequences (Human protein, cDNA, mRNA and genome databases) and 
one cancer-driven database (cancer EST database). We applied our approach to 
MS/MS data obtained from 15 nanoLC-MS/MS runs of HeLa S3 cell phosphoproteome. 
Human protein, cDNA, mRNA and genome databases were used for Mascot peptide 
identification. Following each identification, we 
subtracted all MS/MS spectra corresponding to the identified peptides to exclude 
all spectra matched to peptides expressed from normal sequences. Next, we 
constructed HeLa S3 EST database by combining HeLa S3 ESTs generated in seven different studies. The 
constructed database contains over 60,000 entries. For the remaining 
unidentified MS/MS spectra, we performed the Mascot search against this EST 
database. Consequently, we were able to identify 25 oncogenic peptides including phosphorylated sites. As a future work, we will apply the 
same approach in different cancers aiming to identify global cancer biomarkers 
and drug targets.
2- 
Beyond the Genome: The true gene count, human evolution and disease genomics 
(BTG2010)
Date 
and Venue: October 2010, Harvard Medical School, Boston, MA, USA.. 
Title: 
Onco-proteogenomics: A novel approach for 
identification of cancer-specific mutation combining proteomics and transcriptome deep sequencing     
Abstract: 
The 
accumulation of somatic mutation is a common property in all cancer genomes. 
These mutations include several patterns of mutagenesis such as small 
insertions, chromosomal rearrangement and nucleotide substitutions. 
Consequently, the mutated genomes produce mutant transcriptome and, therefore, mutant proteins that give the 
cancer cell its oncogenic properties [1]. For such 
mutated proteins, however, mass spectrometry-based identification by shotgun 
proteomics is generally difficult because the identification is dependent on 
databases containing normal proteins or hybrid database with normal and mutated 
proteins. Here, we present ’onco-proteogenomics’, a 
novel proteogenomics approach to identify the cancer-related peptides (phospho- and non-phospho peptides) 
and proteins.We analyzed HeLa S3 cells as a test sample by shotgun phosphoproteomics and the obtained data was analyzed byan extended version of MSRI (MS Spectra Reduction after 
Identification), the proteogenomic approach that we used before in the 
identification of novel genomic features in Rice plant [2]. In onco-proteogenomics, we used four databases containing 
normal sequences (Human protein, cDNA, mRNA and genome databases) for Mascot 
peptide identification and removed all the MS/MS spectra corresponds to all 
identified peptides. The reminder MS/MS spectra were searched against one 
cancer-driven database obtained by deep sequencing of HeLa S3 cells to identify cancer-specific peptides. The four 
databases, which contain normal sequences, were used sequentially in the peptide 
identification to identify all potential peptide sequences that can be generated 
from the normal genome. This includes the potential protein sequences, 
junction-peptides and exon-skipping peptides (protein and cDNA databases), exonic peptides (mRNA database) and extragenic peptides (genome database). Following each 
database Mascot search, we removed all the MS/MS spectra correspond to the 
identified peptide sequences and created new files containing the reminder MS/MS 
spectra. Next, we constructed HeLa S3 transcriptome database with data obtained from deep 
sequencing of HeLa S3 cells (NCBI UniGene Database). The constructed database contains over 
60,000 entries. For the remaining unidentified MS/MS spectra, we performed the 
Mascot search against this transcriptome database. 
Consequently, we were able to identify 25 cancer-specific peptides including 
phosphorylated sites. For further check, the 
identified peptides were aligned to the employed normal databases using NCBI 
BLAST. The alignment did not show any significant matches indicating that these 
peptides are specifically expressed in the HeLa S3 
cancer cell-line. As a future work, we will apply the same approach in different 
cancers aiming to identify global cancer biomarkers and drug 
targets