Molecular Interaction Knowledgebase for the E-cell 3 project

Project name: E-Cell 3 project
Affiliation: Project member

Name: Bereczki Gabor, M2

Motivation

Many modeling problems in cell simulation require access to information provided by various public biological databases. Providing unified access to these databases as well as integrating them with model building software presents a standing challenge for designers of modeling systems.

Objectives

The ultimate objective of a modeling environment is to facilitate the research cycle by eliminating the bottleneck of drawing up biochemical networks.

  • Interface to various public databases of biochemical reaction related information
  • Simple and concise user interface for drawing up small scale models from databases
  • methods to recombine then test and debug and parameter estimate various small scale models

Databases

Of great importance to the modeler community are those databases that contain molecular interactions and background information about important molecular biology entities such as genes and proteins. Thus those databases are of primary interest which hold:

  • metabolomic interactions
  • protein-protein interactions
  • transcriptional regulatory interactions
  • gene expression information
  • genes and nucleoitide sequence catalogs
  • protein catalogs
  • biochemical molecule catalogs

The following databases were thus integrated:

KEGG is a comprehensive database of metabolic pathways, reactions, compounds, participating genes, pathways which are parsed as entities.

The Biomolecular Interaction Network Database (BIND) is a collection of records documenting molecular interactions. Bind contains protein-protein, protein-DNA interactions and protein complex information. BIND has closed down curation operations recently because of funding problems.

NCBI Genbank an annotated collection of all publicly available DNA sequences. Genbank is synchronyzed with EBI EMBL on a daily basis hence it contains all DNA sequences known on the earth. In the data warehouse Genbank is parse as genes, DNA sequences, RNA sequences entities.

The NCBI Reference Sequence Project (RefSeq) is an effort to provide the best single collection of naturally occurring biomolecules, representative of the central dogma, for each major organism. The database is a collection of DNA, RNA and Protein sequences and genes.

NCBI Gene has been implemented to organize information about genes, serving as a hub between databases internal and external to in the nexus of genomic map, sequence, expression, protein structure, function, and homology data. The database structure clearly built around the central dogma cross referencing genes, DNA, RNA and protein sequences.

The NCBI Taxonomy contains the names of and assigns IDs to all organisms that are represented in genetic databases with at least one nucleotide or protein sequence. The database is parsed as one entity: taxonomy entry.

Enzyme nomenclature database is a repository of information related to the nomenclature of enzymes. The database contains only enzyme entities

The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases. The database contains ontology entries of biological process, cellular component and function.

The UniProtKB/Swiss-Prot Protein Knowledgebase is a curated protein sequence database that provides a high level of annotation (such as the description of protein function, domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Contains only protein entities.

Methods

Design principles

  • dynamic data model
  • retrieve high quality data only, but do not restrict on scope
  • no inference from raw data
  • maximum recovery of cross references and interactions
  • data provenance

Storing the data

The data model defines the representation of knowledge in our data warehouse. The representation of knowledge is carried out in the form of an attributed undirected graph, where nodes are entities and edges are relationships.

Logical data model The four major modeling objects are: entities, attributes, relationships, rules. An entity is a gene, protein, reaction or any kind of biomolecular phenomena that databases or models can contain. Entities can belong to different classes. Entities belonging to the same class can be merged without loss of information if their global unique identifiers match.

The physical data model:

Getting and parsing the data

The files are downloaded from the database servers using ftp protocol. The file formats are usually flat file or XML. The files are parsed into preprocessing tables. Every preprocessing table represent a different type of entity. During parsing multivalue attributes are normalized and grouped into different rows.i

During load the preprocessed data is transferred from the “temp” database to the “data” database. Attributes are semantically labeled, that is, every attribute is assigned an attribute code by using a mapping table. Globally unique identifiers of entities and attribute rows are also assigned in the load phase.

Semantic labeling of properties happens by rules that specifies what label should be given to a specific attribute in a specific database. The most important principle for the sematic labeling is to find common identifiers and names in different databases and label them with the same attribute ID for the purpose of cross-referencing. Most frequent common identifiers are: EC number, GI ID, Gene ID, Uniprot ID, various KEGG IDs. There are 933 labeling rules for 33 preprocessing tables.

Integrating the data

Integration of databases means effectively merging entities of the same type from different ( or the same ) database source. Because attributes are made uniform during the semantic labelling procedure merging happens without regard to the original source of the data.

Integration happens in 2 steps:

Merging of entities is performed by rules stored in the merge_rules table. If the values of a certain type of identifier match in different instances of a certain type of entities, the different instances of entities will be merged. Merging actually means overwriting the entity ID of the attributes belonging to the other to be merged entities with the surviving entity ID. The surviving entity ID is chosen randomly from the matching ones.

Establishing relationships. Relationships (such as “type of”, “part of”, “participates”, “specialization of” etc.) are determined by another rule table; relationship_rules. A relationship between two entities can be set up if they belong to a certain type and the value of one of attribute1 of entity1 matches attribute2 of entitty2. This problem is thus very similar to that of merging by attributes and is implemented in a similar fashion. There are insofar 45 relation rules introduced into the data warehouse.

Distributing the data

Integrated data should be made available to users in many different forms. It is not recommendable to expose the underlying database tables to users directly therefore a communication layer was established. HTTP was chosen as the protocol of choice for communication between the data warehouse and the client side because HTTP is universal and poses negligible security challenges.

XML was chosen as the media of communication as it is widely used for data exchange purposes and both HTML and SOAP RPC is built upon XML technology.

Client side tools

The client side tools comprise a web services which is designed in a minimalist approach in order to provide the core access to database contents.

An MVC modeling tool is being developed which communicates to the data warehouse server and the web surface.

Implementation

Schematic diagram of implementation and used technologies

Results

Integrated data warehouse: up and running
Web services: up and running
Soap services: up and running
Java editor tool: under development
Model Editor database support: only server side implemented

Statistics

Data content and storage : the database contains 13,383,335 unique entities, 198,923,486 attributes and 8,770,663 relations between entities. The database occupies around 12.3 GB of disk space without indexes, which consume another 14.5GB.

Characterizatrion of data

User Interfaces

Web services

facilitate easy search and browsing of the database. The user can perform keyword search on a Google like interface. The database engine uses fulltext indexes to perform a very fast search on all of the attribute values ( texts, names, identifiers, metadata ) and returns the result in group of ten entities

By clicking on one of the hits detailed results are presented. The detailed results contain

  • the type of the entity
  • the relationships the entity participates
  • the detailed attribute list
  • links to outside references ( if any )
  • data sources

  • Java Editor tool

    To make the browsing through the biomolecular network more user friendly a Java editing tool is being developed. The Java editing tool is a simplified SBML editor which performs the following tasks:

  • keeps track of the entity information pages the user has visited
  • upon request build a graphical representation of the network walked through
  • can extend this graph by automatic actions
  • can save the graph, which is in fact a model skeleton, in SBML format
  • annotate SBML model skeleton with database identifier information

  • SOAP RPC Services

    This interface serves as a bridge between applications and the datawarehouse. SOAP RPC means XML packaged remote procedure calls through HTTP protocol. Currently the Java editing tool can communicate to SOAP RPC interface and retrieve entity specific information from the database.

    The current interface can process the following RPC calls:

    Procedure name Parameters Return value
    getNames entity_id name(s) of entity
    getType entity_id type of the entity
    getAllProperties entity_id all properties and their values
    getRelations entity_id the type and other participating other entity of the relationship

    Availability


    The web services interface is available at

    http://balsa.e-cell.org:8888/webservices/sqlreq

    the SOAP RPC service is available at

    http://balsa.e-cell.org:8888/webservices/soapreq.

    Contact: gabor@sfc.keio.ac.jp