Achievements
report for Graduate Student Researcher Development Grant by Taikichiro Mori Memorial Research Fund for the academic
year 2010
Student
name: Mohamed Helmy Mahmoud Attia Shehata
Student
number: 80725860
Student
grade: Ph.D., 2nd grade.
Affiliation: Gradute School for Media and Governance
Research
project: Rice Proteogenomics project
Introduction
Using
the Graduate Student Researcher Development Grant by Taikichiro Mori Memorial Research Fund, I was able to
improve my research environment and therefore, achieve better results in my
research. This resulted in writing three research articles two are already
submitted and one is under fonal preparation. In
addition I contributed to two international conferences. The research grant
helped me to upgrade my workspace by adding new hardware e.g. MacMini and monitor. Add new mobile workstation by
purchasing new Apple iPad and some peripheral.
Moreover, I used the grant to cover some of the traveling expenses while
traveling for the conference and paying the conference registration
fees.
Research
Articles
1
-Title: OryzaPG-DB: Rice Proteome Database based on
Shotgun Proteogenomics
Journal:
BMC Plant Biology
Status:
Revised version submitted
Abstract:
Background: Proteogenomics aims to utilize experimental proteome
information for refinement of genome annotation. Since mass spectrometry-based
shotgun proteomics approaches provide large-scale peptide sequencing data with
high throughput, a data repository for shotgun proteogenomics would represent a
valuable source of gene expression evidence at the translational level for
genome re-annotation.
Description: Here, we present OryzaPG-DB, a rice proteome database based on shotgun
proteogenomics, which incorporates the genomic features of experimental shotgun
proteomics data. This version of the database was created from the results of 27
nanoLC-MS/MS runs on a hybrid ion trap-orbitrap mass spectrometer, which offers high accuracy for
analyzing tryptic digests from undifferentiated cultured rice cells. Peptides
were identified by searching the product ion spectra against the protein, cDNA,
transcript and genome databases from Michigan State University, and were mapped
to the rice genome. Approximately 3200 genes were covered by these peptides and
40 of them contained novel genomic features. Users can search, download or
navigate the database per chromosome, gene, protein, cDNA or transcript and
download the updated annotations in standard GFF3 format, with visualization in
PNG format. In addition, the database scheme of OryzaPG was designed to be generic and can be reused to host
similar proteogenomic information for other species. OryzaPG is the first proteogenomics-based database of the
rice proteome, providing peptide-based expression profiles, together with the
corresponding genomic origin, including the annotation of novelty for each
peptide. Conclusions: The OryzaPG database was
constructed and is freely available at http://oryzapg.iab.keio.ac.jp/.
2
-Title: Peptide identification by searching large-scale tandem mass spectra
against large databases: bioinformatics methods in
proteogenomics
Journal:
Journal of genes, genome and genomics
Status:
Submitted
Abstract:
Mass spectrometry-based shotgun proteomics approaches are currently considered
as the technology-of-choice for large-scale proteogenomics due to high
throughput, good availability and relative ease of use. Protein mixtures are
firstly digested with protease, e. g. trypsin, and the resultant peptides are
analyzed using liquid chromatography - tandem mass spectrometry. Proteins and
peptides are identified from the resultant tandem mass spectra by de novo
interpretation of the spectra or by searching databases of putative sequences.
Since this data represents the expressed proteins in the sample, it can be used
to infer novel proteogenomic features when mapped to the genome. However,
high-throughput mass spectrometry instruments can readily generate hundreds of
thousands, perhaps millions, of spectra and the size of genomic databases, such
as six-frame translated genome databases, is enormous. Therefore, computational
demands are very high, and there is potential inaccuracy in peptide
identification due to the large search space. These issues are considered the
main challenges that limit the utilization of this approach. In this review, we
highlight the efforts of the proteomics and bioinformatics communities to
develop methods, algorithms and software tools that facilitate peptide sequence
identification from databases in large-scale proteogenomic studies.
3
-Title: Mass Spectrum Sequential Subtraction: bioinformatics method facilitates
searching large dataset of peptide MS/MS spectra against large nucleotide
databases for proteogenomicsmethods in
proteogenomics
Status:
Under preparation
Abstract:
We developed MSSS (Mass Spectrum Sequential Subtraction), a novel bioinformatics method to compare the large datasets
of peptide spectra produced by liquid chromatography-mass spectrometry
(LC-MS/MS) a
International
Conferences Contributions
1-
The 9th Annual World Congress of the Human Proteome Organization (HUPO2010).
Sydney, Australia.
Date
and Venue: September 2010, Sydney Convention and Exhibition Center, Sydney,
Australia.
Title:
Onco-proteogenomics:
toward the identification of oncogenic peptides and
proteins.
Abstract:
Accumulation
of somatic mutation is a common property in all cancer genomes. These mutations
include several patterns of mutagenesis such as small insertions, chromosomal
rearrangement and nucleotide substitutions. Consequently, the mutated genomes
produce mutant proteins that give the cancer cell its oncogenic properties. For such mutated proteins, however,
mass spectrometry-based identification by shotgun proteomics is generally
difficult because the identification is dependent on databases containing normal
proteins or hybrid database with normal and mutated proteins. Here, we present
’onco-proteogenomics’, a novel proteogenomics approach
to identify the cancer-related peptides and proteins using four databases
containing normal sequences (Human protein, cDNA, mRNA and genome databases) and
one cancer-driven database (cancer EST database). We applied our approach to
MS/MS data obtained from 15 nanoLC-MS/MS runs of HeLa S3 cell phosphoproteome.
Human protein, cDNA, mRNA and genome databases were used for Mascot peptide
identification. Following each identification, we
subtracted all MS/MS spectra corresponding to the identified peptides to exclude
all spectra matched to peptides expressed from normal sequences. Next, we
constructed HeLa S3 EST database by combining HeLa S3 ESTs generated in seven different studies. The
constructed database contains over 60,000 entries. For the remaining
unidentified MS/MS spectra, we performed the Mascot search against this EST
database. Consequently, we were able to identify 25 oncogenic peptides including phosphorylated sites. As a future work, we will apply the
same approach in different cancers aiming to identify global cancer biomarkers
and drug targets.
2-
Beyond the Genome: The true gene count, human evolution and disease genomics
(BTG2010)
Date
and Venue: October 2010, Harvard Medical School, Boston, MA, USA..
Title:
Onco-proteogenomics: A novel approach for
identification of cancer-specific mutation combining proteomics and transcriptome deep sequencing
Abstract:
The
accumulation of somatic mutation is a common property in all cancer genomes.
These mutations include several patterns of mutagenesis such as small
insertions, chromosomal rearrangement and nucleotide substitutions.
Consequently, the mutated genomes produce mutant transcriptome and, therefore, mutant proteins that give the
cancer cell its oncogenic properties [1]. For such
mutated proteins, however, mass spectrometry-based identification by shotgun
proteomics is generally difficult because the identification is dependent on
databases containing normal proteins or hybrid database with normal and mutated
proteins. Here, we present ’onco-proteogenomics’, a
novel proteogenomics approach to identify the cancer-related peptides (phospho- and non-phospho peptides)
and proteins.We analyzed HeLa S3 cells as a test sample by shotgun phosphoproteomics and the obtained data was analyzed byan extended version of MSRI (MS Spectra Reduction after
Identification), the proteogenomic approach that we used before in the
identification of novel genomic features in Rice plant [2]. In onco-proteogenomics, we used four databases containing
normal sequences (Human protein, cDNA, mRNA and genome databases) for Mascot
peptide identification and removed all the MS/MS spectra corresponds to all
identified peptides. The reminder MS/MS spectra were searched against one
cancer-driven database obtained by deep sequencing of HeLa S3 cells to identify cancer-specific peptides. The four
databases, which contain normal sequences, were used sequentially in the peptide
identification to identify all potential peptide sequences that can be generated
from the normal genome. This includes the potential protein sequences,
junction-peptides and exon-skipping peptides (protein and cDNA databases), exonic peptides (mRNA database) and extragenic peptides (genome database). Following each
database Mascot search, we removed all the MS/MS spectra correspond to the
identified peptide sequences and created new files containing the reminder MS/MS
spectra. Next, we constructed HeLa S3 transcriptome database with data obtained from deep
sequencing of HeLa S3 cells (NCBI UniGene Database). The constructed database contains over
60,000 entries. For the remaining unidentified MS/MS spectra, we performed the
Mascot search against this transcriptome database.
Consequently, we were able to identify 25 cancer-specific peptides including
phosphorylated sites. For further check, the
identified peptides were aligned to the employed normal databases using NCBI
BLAST. The alignment did not show any significant matches indicating that these
peptides are specifically expressed in the HeLa S3
cancer cell-line. As a future work, we will apply the same approach in different
cancers aiming to identify global cancer biomarkers and drug
targets