Notes
Outline
Using The Web Services Model To Solve Data Integration Problems In BioInformatics
Brian Gilman
Whitehead Institute
The Vision
Users access data through a common look and feel (Ensembl look and feel)
Never write a fasta parser ever again
Biological data is transferred over the web via a standard set of protocols (DAS/SOAP)
Data is aggregated via a common middleware and relationships amongst data are auto-discovered (Semantic Web)
The Problem
The Solution
Provide an infrastructure which is portable, extensible and robust for this domain
Use portable solutions from other domains to solve problems in BioInformatics
HTTP
XML
SOAP
UDDI
DAS
What Are Web Services?
OmniGene Our Proposed Solution
OmniGene Provides A Web Services Based Middleware to Access Biological Data
Data Stores include:
Genomics
Ensembl
NCBI
Proteomics
Swissprot
Publications
Pubmed
How Do We Allow Universal Access To Disparate Data Sources?
Provide middleware to perform web service integration and translation
Data accessed through common protocol stack (“Soapy DAS”)
Data is referenced by translation to common naming scheme (Ontology)
Slide 9
The Web Services Model
OmniGene Web Service Model
OmniGene Implements the Web Services Model By Using XML and Enterprise Java Beans
A database schema can be represented as XML Schema
XML is a W3C standard
XML is supported industry wide
EJB is scalable and robust
<? xml version=“1.0” ?>
<schema source=“WIBR”>
   <table name=“SNP” >
     <field name=“allele” type=“VARCHAR” />
     <field name=“left_flank” type=“TEXT”/>
  </table>
</schema>
Java Objects Can Be Dynamically Created From XML Documents
XML-->Object Paradigm is well understood
XML is easily translated into other object models
Import java_classes.*;
Import ejb_classes.*;
public class SNP_WIBR extends EJB_Classes implements Table{
private String left_flank, allele;
public SNP_WIBR(){}
public getLeft_flank(){}
public setLeft_flank(){}
public getAllele(){}
public setAllele(){}
}
What Is The Distributed Annotation System (DAS)?
A protocol which utilizes HTTP and XML to query genomic data
Genomic features
Sequence Data
Proteomic Data (WIBR Initiative)
Publication Data (WIBR Initiative)
Workflow Data (WIBR Initiative)
Distributed Annotation System
Distributed Annotation System
Distributed Annotation System
Distributed Annotation System
DAS Requirements
No dependency on particular database schemas or technologies
No dependency on particular client-side technologies
Uncoupled reference and annotation servers
Must handle instability in genome assemblies
Must be dirt simple to implement
What is an Annotation?
Anything that has genomic coordinates
Many Different Coordinates
Step (1): Fetch Data Sources
Step (2): Return Data Sources
Step (3): Retrieve Map
Step (4): Retrieve Annotation Servers (optional)
Step (5): Request Annotations
Step (6): Retrieve Annotations
Step (7): Request Stylesheets (optional)
Step (8): Integrate and Render
Technology
Client/Server model
Communications via XML
Servers run on top of conventional web servers
Clients use Open Source XML parsers
Servers: >100 lines of code
Clients:  >1000 lines code
An Annotation Record
As Transmitted
Annotation Filtering
Semi-controlled feature vocabulary
Category
Transcription, translation, structural, experimental
Type
intron, exon, CDS, 5’UTR, SNP, similarity, oligo, insertion, RNAi
User can filter by category and/or type
Data sources can add new types at will
Versioning Issues
Annotate to smallest stable sequence element
finished clone
phase II fragment
Version everything
Annotations, contigs, assemblies
 Assembly Version Changes
Software
Libraries
Bio::DAS (Perl)
Dazzle (Java)
DASQuery (WICGR API)
Servers & Databases
Acedb, Dazzle-on-Ensembl, Gadfly, Bio::DB::GFF
OmniGene
Clients
Java-Client, Geodesic (Java), DasView (Perl), Ensembl Contigview (Perl), OmniView (Java)
DAS Implementations
Reference servers
WormBase (C. elegans)
FlyBase (Drosophila)
Ensembl (Human)
HGxxx (UCSC)
Annotation servers
WormBase (C. elegans)
WashU (elegans)
Ensembl (Human)
FlyBase (Drosophila)
TIGR (Human, elegans)
MRC (Human, elegans)
LBL (Human)
Limitations with DAS version 1
Difficult to represent nested subfeatures
Can’t annotate non-genomic references
Too narrowly focussed on genomic data
Read only protocol
Open Source Distribution
Software & Specifications
http://www.biodas.org
http://www.biojava.org
http://www.bioxml.org
http://www.sourceforge.net/projects/omnigene
For More Information
www.sourceforge.net/projects/omnigene
devo.wi.mit.edu/~gilmanb/omnigene
www.uddi.org
www.w3c.org
xml.apache.org
www.biodas.org
Wormbase 1
Wormbase 2
Wormbase 3
EnsEMBL 1
EnsEMBL 2