Using The Web Services
Model To Solve Data Integration Problems In BioInformatics
|
|
|
Brian Gilman |
|
Whitehead Institute |
The Vision
|
|
|
Users access data through a common look
and feel (Ensembl look and feel) |
|
Never write a fasta parser ever again |
|
Biological data is transferred over the
web via a standard set of protocols (DAS/SOAP) |
|
Data is aggregated via a common
middleware and relationships amongst data are auto-discovered (Semantic Web) |
The Problem
The Solution
|
|
|
|
Provide an infrastructure which is
portable, extensible and robust for this domain |
|
Use portable solutions from other
domains to solve problems in BioInformatics |
|
HTTP |
|
XML |
|
SOAP |
|
UDDI |
|
DAS |
What Are Web Services?
OmniGene Our Proposed
Solution
OmniGene Provides A Web
Services Based Middleware to Access Biological Data
|
|
|
|
|
Data Stores include: |
|
Genomics |
|
Ensembl |
|
NCBI |
|
Proteomics |
|
Swissprot |
|
Publications |
|
Pubmed |
How Do We Allow Universal
Access To Disparate Data Sources?
|
|
|
Provide middleware to perform web
service integration and translation |
|
Data accessed through common protocol
stack (“Soapy DAS”) |
|
Data is referenced by translation to
common naming scheme (Ontology) |
Slide 9
The Web Services Model
OmniGene Web Service
Model
OmniGene Implements the
Web Services Model By Using XML and Enterprise Java Beans
|
|
|
A database schema can be represented as
XML Schema |
|
XML is a W3C standard |
|
XML is supported industry wide |
|
EJB is scalable and robust |
|
<? xml version=“1.0” ?> |
|
<schema source=“WIBR”> |
|
<table name=“SNP” > |
|
<field name=“allele” type=“VARCHAR” /> |
|
<field name=“left_flank” type=“TEXT”/> |
|
</table> |
|
</schema> |
Java Objects Can Be
Dynamically Created From XML Documents
|
|
|
XML-->Object Paradigm is well
understood |
|
XML is easily translated into other
object models |
|
|
|
Import java_classes.*; |
|
Import ejb_classes.*; |
|
public class SNP_WIBR extends
EJB_Classes implements Table{ |
|
private String left_flank, allele; |
|
public SNP_WIBR(){} |
|
public getLeft_flank(){} |
|
public setLeft_flank(){} |
|
public getAllele(){} |
|
public setAllele(){} |
|
} |
|
|
What Is The Distributed
Annotation System (DAS)?
|
|
|
|
A protocol which utilizes HTTP and XML
to query genomic data |
|
Genomic features |
|
Sequence Data |
|
Proteomic Data (WIBR Initiative) |
|
Publication Data (WIBR Initiative) |
|
Workflow Data (WIBR Initiative) |
|
|
Distributed Annotation
System
Distributed Annotation
System
Distributed Annotation
System
Distributed Annotation
System
DAS Requirements
|
|
|
No dependency on particular database
schemas or technologies |
|
No dependency on particular client-side
technologies |
|
Uncoupled reference and annotation
servers |
|
Must handle instability in genome
assemblies |
|
Must be dirt simple to implement |
What is an Annotation?
|
|
|
Anything that has genomic coordinates |
Many Different
Coordinates
Step (1): Fetch Data
Sources
Step (2): Return Data
Sources
Step (3): Retrieve Map
Step (4): Retrieve
Annotation Servers (optional)
Step (5): Request
Annotations
Step (6): Retrieve
Annotations
Step (7): Request
Stylesheets (optional)
Step (8): Integrate and
Render
Technology
|
|
|
Client/Server model |
|
Communications via XML |
|
Servers run on top of conventional web
servers |
|
Clients use Open Source XML parsers |
|
Servers: >100 lines of code |
|
Clients: >1000 lines code |
An Annotation Record
As Transmitted
Annotation Filtering
|
|
|
|
Semi-controlled feature vocabulary |
|
Category |
|
Transcription, translation, structural,
experimental |
|
Type |
|
intron, exon, CDS, 5’UTR, SNP,
similarity, oligo, insertion, RNAi |
|
User can filter by category and/or type |
|
Data sources can add new types at will |
Versioning Issues
|
|
|
|
Annotate to smallest stable sequence
element |
|
finished clone |
|
phase II fragment |
|
Version everything |
|
Annotations, contigs, assemblies |
Assembly Version Changes
Software
|
|
|
|
Libraries |
|
Bio::DAS (Perl) |
|
Dazzle (Java) |
|
DASQuery (WICGR API) |
|
Servers & Databases |
|
Acedb, Dazzle-on-Ensembl, Gadfly,
Bio::DB::GFF |
|
OmniGene |
|
Clients |
|
Java-Client, Geodesic (Java), DasView
(Perl), Ensembl Contigview (Perl), OmniView (Java) |
DAS Implementations
|
|
|
|
Reference servers |
|
WormBase (C. elegans) |
|
FlyBase (Drosophila) |
|
Ensembl (Human) |
|
HGxxx (UCSC) |
|
Annotation servers |
|
WormBase (C. elegans) |
|
WashU (elegans) |
|
Ensembl (Human) |
|
FlyBase (Drosophila) |
|
TIGR (Human, elegans) |
|
MRC (Human, elegans) |
|
LBL (Human) |
Limitations with DAS
version 1
|
|
|
Difficult to represent nested
subfeatures |
|
Can’t annotate non-genomic references |
|
Too narrowly focussed on genomic data |
|
Read only protocol |
Open Source Distribution
|
|
|
|
Software & Specifications |
|
http://www.biodas.org |
|
http://www.biojava.org |
|
http://www.bioxml.org |
|
http://www.sourceforge.net/projects/omnigene |
For More Information
|
|
|
www.sourceforge.net/projects/omnigene |
|
devo.wi.mit.edu/~gilmanb/omnigene |
|
www.uddi.org |
|
www.w3c.org |
|
xml.apache.org |
|
www.biodas.org |
Wormbase 1
Wormbase 2
Wormbase 3
EnsEMBL 1
EnsEMBL 2