Molecular Interaction Knowledgebase for the E-cell 3 projectProject name: E-Cell 3 project
|
|||||||||||||||||||
MotivationMany modeling problems in cell simulation require access to information provided by various public biological databases. Providing unified access to these databases as well as integrating them with model building software presents a standing challenge for designers of modeling systems. ObjectivesThe ultimate objective of a modeling environment is to facilitate the research cycle by eliminating the bottleneck of drawing up biochemical networks.
DatabasesOf great importance to the modeler community are those databases that contain molecular interactions and background information about important molecular biology entities such as genes and proteins. Thus those databases are of primary interest which hold:
The following databases were thus integrated:
MethodsDesign principles
Storing the dataThe data model defines the representation of knowledge in our data warehouse. The representation of knowledge is carried out in the form of an attributed undirected graph, where nodes are entities and edges are relationships. Logical data model The four major modeling objects are: entities, attributes, relationships, rules. An entity is a gene, protein, reaction or any kind of biomolecular phenomena that databases or models can contain. Entities can belong to different classes. Entities belonging to the same class can be merged without loss of information if their global unique identifiers match. The physical data model:
Getting and parsing the dataThe files are downloaded from the database servers using ftp protocol. The file formats are usually flat file or XML. The files are parsed into preprocessing tables. Every preprocessing table represent a different type of entity. During parsing multivalue attributes are normalized and grouped into different rows.i During load the preprocessed data is transferred from the “temp” database to the “data” database. Attributes are semantically labeled, that is, every attribute is assigned an attribute code by using a mapping table. Globally unique identifiers of entities and attribute rows are also assigned in the load phase. Semantic labeling of properties happens by rules that specifies what label should be given to a specific attribute in a specific database. The most important principle for the sematic labeling is to find common identifiers and names in different databases and label them with the same attribute ID for the purpose of cross-referencing. Most frequent common identifiers are: EC number, GI ID, Gene ID, Uniprot ID, various KEGG IDs. There are 933 labeling rules for 33 preprocessing tables. Integrating the dataIntegration of databases means effectively merging entities of the same type from different ( or the same ) database source. Because attributes are made uniform during the semantic labelling procedure merging happens without regard to the original source of the data. Integration happens in 2 steps:Merging of entities is performed by rules stored in the merge_rules table. If the values of a certain type of identifier match in different instances of a certain type of entities, the different instances of entities will be merged. Merging actually means overwriting the entity ID of the attributes belonging to the other to be merged entities with the surviving entity ID. The surviving entity ID is chosen randomly from the matching ones. Establishing relationships. Relationships (such as “type of”, “part of”, “participates”, “specialization of” etc.) are determined by another rule table; relationship_rules. A relationship between two entities can be set up if they belong to a certain type and the value of one of attribute1 of entity1 matches attribute2 of entitty2. This problem is thus very similar to that of merging by attributes and is implemented in a similar fashion. There are insofar 45 relation rules introduced into the data warehouse. Distributing the dataIntegrated data should be made available to users in many different forms. It is not recommendable to expose the underlying database tables to users directly therefore a communication layer was established. HTTP was chosen as the protocol of choice for communication between the data warehouse and the client side because HTTP is universal and poses negligible security challenges. XML was chosen as the media of communication as it is widely used for data exchange purposes and both HTML and SOAP RPC is built upon XML technology.Client side toolsThe client side tools comprise a web services which is designed in a minimalist approach in order to provide the core access to database contents. An MVC modeling tool is being developed which communicates to the data warehouse server and the web surface.ImplementationSchematic diagram of implementation and used technologies | |||||||||||||||||||
Integrated data warehouse: | up and running | |
Web services: | up and running | |
Soap services: | up and running | |
Java editor tool: | under development | |
Model Editor database support: | only server side implemented |
Data content and storage : the database contains 13,383,335 unique entities, 198,923,486 attributes and 8,770,663 relations between entities. The database occupies around 12.3 GB of disk space without indexes, which consume another 14.5GB.
facilitate easy search and browsing of the database. The user can perform keyword search on a Google like interface. The database engine uses fulltext indexes to perform a very fast search on all of the attribute values ( texts, names, identifiers, metadata ) and returns the result in group of ten entities
By clicking on one of the hits detailed results are presented. The detailed results contain
To make the browsing through the biomolecular network more user friendly a Java editing tool is being developed. The Java editing tool is a simplified SBML editor which performs the following tasks:
This interface serves as a bridge between applications and the datawarehouse. SOAP RPC means XML packaged remote procedure calls through HTTP protocol. Currently the Java editing tool can communicate to SOAP RPC interface and retrieve entity specific information from the database.
The current interface can process the following RPC calls:
Procedure name | Parameters | Return value |
---|---|---|
getNames | entity_id | name(s) of entity |
getType | entity_id | type of the entity |
getAllProperties | entity_id | all properties and their values |
getRelations | entity_id | the type and other participating other entity of the relationship |