Using The Web Services Model To Solve Data Integration Problems In BioInformatics
Brian Gilman
Whitehead Institute

The Vision
Users access data through a common look and feel (Ensembl look and feel)
Never write a fasta parser ever again
Biological data is transferred over the web via a standard set of protocols (DAS/SOAP)
Data is aggregated via a common middleware and relationships amongst data are auto-discovered (Semantic Web)

The Problem

The Solution
Provide an infrastructure which is portable, extensible and robust for this domain
Use portable solutions from other domains to solve problems in BioInformatics
HTTP
XML
SOAP
UDDI
DAS

What Are Web Services?

OmniGene Our Proposed Solution

OmniGene Provides A Web Services Based Middleware to Access Biological Data
Data Stores include:
Genomics
Ensembl
NCBI
Proteomics
Swissprot
Publications
Pubmed

How Do We Allow Universal Access To Disparate Data Sources?
Provide middleware to perform web service integration and translation
Data accessed through common protocol stack (“Soapy DAS”)
Data is referenced by translation to common naming scheme (Ontology)

Slide 9

The Web Services Model

OmniGene Web Service Model

OmniGene Implements the Web Services Model By Using XML and Enterprise Java Beans
A database schema can be represented as XML Schema
XML is a W3C standard
XML is supported industry wide
EJB is scalable and robust
<? xml version=“1.0” ?>
<schema source=“WIBR”>
   <table name=“SNP” >
     <field name=“allele” type=“VARCHAR” />
     <field name=“left_flank” type=“TEXT”/>
  </table>
</schema>

Java Objects Can Be Dynamically Created From XML Documents
XML-->Object Paradigm is well understood
XML is easily translated into other object models
Import java_classes.*;
Import ejb_classes.*;
public class SNP_WIBR extends EJB_Classes implements Table{
private String left_flank, allele;
public SNP_WIBR(){}
public getLeft_flank(){}
public setLeft_flank(){}
public getAllele(){}
public setAllele(){}
}

What Is The Distributed Annotation System (DAS)?
A protocol which utilizes HTTP and XML to query genomic data
Genomic features
Sequence Data
Proteomic Data (WIBR Initiative)
Publication Data (WIBR Initiative)
Workflow Data (WIBR Initiative)

Distributed Annotation System

Distributed Annotation System

Distributed Annotation System

Distributed Annotation System

DAS Requirements
No dependency on particular database schemas or technologies
No dependency on particular client-side technologies
Uncoupled reference and annotation servers
Must handle instability in genome assemblies
Must be dirt simple to implement

What is an Annotation?
Anything that has genomic coordinates

Many Different Coordinates

Step (1): Fetch Data Sources

Step (2): Return Data Sources

Step (3): Retrieve Map

Step (4): Retrieve Annotation Servers (optional)

Step (5): Request Annotations

Step (6): Retrieve Annotations

Step (7): Request Stylesheets (optional)

Step (8): Integrate and Render

Technology
Client/Server model
Communications via XML
Servers run on top of conventional web servers
Clients use Open Source XML parsers
Servers: >100 lines of code
Clients:  >1000 lines code

An Annotation Record

As Transmitted

Annotation Filtering
Semi-controlled feature vocabulary
Category
Transcription, translation, structural, experimental
Type
intron, exon, CDS, 5’UTR, SNP, similarity, oligo, insertion, RNAi
User can filter by category and/or type
Data sources can add new types at will

Versioning Issues
Annotate to smallest stable sequence element
finished clone
phase II fragment
Version everything
Annotations, contigs, assemblies

 Assembly Version Changes

Software
Libraries
Bio::DAS (Perl)
Dazzle (Java)
DASQuery (WICGR API)
Servers & Databases
Acedb, Dazzle-on-Ensembl, Gadfly, Bio::DB::GFF
OmniGene
Clients
Java-Client, Geodesic (Java), DasView (Perl), Ensembl Contigview (Perl), OmniView (Java)

DAS Implementations
Reference servers
WormBase (C. elegans)
FlyBase (Drosophila)
Ensembl (Human)
HGxxx (UCSC)
Annotation servers
WormBase (C. elegans)
WashU (elegans)
Ensembl (Human)
FlyBase (Drosophila)
TIGR (Human, elegans)
MRC (Human, elegans)
LBL (Human)

Limitations with DAS version 1
Difficult to represent nested subfeatures
Can’t annotate non-genomic references
Too narrowly focussed on genomic data
Read only protocol

Open Source Distribution
Software & Specifications
http://www.biodas.org
http://www.biojava.org
http://www.bioxml.org
http://www.sourceforge.net/projects/omnigene

For More Information
www.sourceforge.net/projects/omnigene
devo.wi.mit.edu/~gilmanb/omnigene
www.uddi.org
www.w3c.org
xml.apache.org
www.biodas.org

Wormbase 1

Wormbase 2

Wormbase 3

EnsEMBL 1

EnsEMBL 2