Slides - bioCADDIE

DataBridge
http://databridge.web.unc.edu/"
Arcot Rajasekar"
rajasekar@unc.edu
The University of North Carolina at Chapel Hill
"
Data Bridge: A Social Network for Long Tail
Science Data"
Outline of the Talk"
•  Motivation"
•  Design"
•  Implementation Status"
•  Examples"
•  Future"
"
Data Bridge"
2"
Big Data"
•  Well-known problem"
•  Characteristics:"
–  Volume"
•  Exponential Increase in Size & Count"
cognizan(.weber.org –  Velocity"
•  Speed of Generation & Consumption"
–  Variety"
•  Disparate Types of data"
–  Veracity"
•  Integrity & Fidelity"
–  Value"
•  Worth"
infographics.socialnama.com Data Bridge"
3"
Three Kinds of Big Data (1)"
•  Archetypal Big Data"
–  Science Projects – LHC, LSST, SCEC, OOI…"
–  Business/Industry – Genomics, Finance, Pharma…"
–  Government – NASA, NOAA, NCDC…"
"
•  Volume – High – large datasets"
•  Velocity – High but predictable"
•  Variety – Low – Standardized, Metadata, Curated"
•  Veracity – High Fidelity and Credible"
infographics.socialnama.com •  Value – High – funded"
•  Findability – High – Known sites and discovery mechanisms"
wired.com •  Availability – High – Published API"
Light & Visible Data "
4"
Three Kinds of Big Data (2)"
•  Crowd-Sourced Big Data"
–  Social Media – Facebook, Twitter…"
–  Recommenders – Yelp, Angie’s List, Groupon"www.plannedparenthood.org-­‐ –  Web Commerce – Amazon, Ebay, Orbitz, eNews"
•  Volume – High – small data"
•  Velocity – High and non predictable"
•  Variety – High – But well managed"
•  Veracity – Mixed – Low to High"
•  Value – Ephemeral – can be None to High"
•  Findability – High – Known and Advertised"
dreams(me.com •  Availability – Immediate Interest – Web pages and Apps"
Nova-like Data "
Data Bridge"
5"
Three Kinds of Big Data (3)"
•  Long-tail Big Data"
–  Science Projects – small teams and organizations"
–  Personal – Hobbies, Amateur/Citizen Science/Arts"
–  Government – Internal and unpublished"
•  Volume – High – small data sets"
•  Velocity – Low"
•  Variety – High – Too many, Idiosyncratic"
•  Veracity – Non Credible until proven"
•  Value – unknown"
•  Findability – None – Hidden and not advertised"
•  Availability – None – In local, disks and tapes"
teradata.com Images.frompo.com Dark Data "
6"
Our Interest: Long tail of Science Data"
•  Large number of data
generators"
•  Highly distributed"
•  NSF in 2011"
–  11,150 awards"
–  Median size $126K"
–  Primarily single PI"
•  Data individually not
petascale but large in
aggregate"
•  Of possible Value "
• 
• 
• 
• 
Dark Data"
Unpublished"
Used once and Forgotten"
Sunset Data"
–  Even NSF retention
expects at the least only 3
years after project"
•  Expired Data "
–  curated but disposed"
–  Social media data"
com2733amandaathens.blogspot.com 7"
Problems: Long-tail of Science Data"
First Mile Problem"
•  How to make it available?"
•  Where do I upload?"
•  Who is in charge?"
•  How do I get credit?"
•  Can I control access?"
•  How do I pool with other
like-minded researchers?
Community services?"
•  How much is long-term?"
•  Who pays for it?"
Last Mile Problem"
•  How to make it findable?"
•  What is needed to make it
more visible? Metadata?"
•  Are there other methods to
make my data findable?"
•  My data has specific ways &
characteristics"
–  How do I expose "
them as finding "
aids?"
•  How can I find similar "
data?"
Solving the long-tail problem will also help "
other two Big Data problems "
blog.enrichconsul(ng.com Data Bridge"
8"
Dark Data from The Long Tail of Science
"
•  Long tail data amounts to small data sets produced by
numerous investigators."
•  Dark Data Exemplars:"
–  From Brahe to Mendel, discovery has come from relatively
small data sets"
•  Much long tail data is dark data, data “not easily found by
potential users” (Bryan Heidorn)"
•  Long tail data sets lack structural advantages of “classic” Big
Data, such as professional curation and homogeneous
formats and well-documented data formats and populated &
well-formed metadata schema."
astrosolar.com •  Improving availability and findability will help "
solve the problems with this type of big data "
and make it more mainstream."
"
Expose the hidden nuggets"
9"
Data Bridge: A Social Network for Long Tail
Science Data"
Outline of the Talk"
•  Motivation"
•  Design"
•  Implementation Status"
•  Examples"
•  Future"
"
Data Bridge"
10"
Data Bridge: A solution"
•  We tackle mainly the last mile problem"
–  But show an avenue for solving the first mile problem"
•  Main Aim: Improve Findability"
•  Solution: Empower Data!!"
•  Empower data to “find” its own community"
•  Community of likeness"
•  Similarity in multiple dimensions"
•  Look at data from diverse angles"
•  Find "
–  Relationships – strong links "
–  Friendships – weaker links"
•  Assist scientists in discovering "
“interesting” data sets by automatically "
forming communities of data"
"
Data Bridge"
11"
Data Bridge Strategy: Automatic
Community Detection "
•  Doc Watson albums at
Amazon.com"
•  Clustered by Yasiv.com"
•  Clusters represent "
related items"
•  Clusters are connected
by some “internal "
metrics” "
Data Bridge"
12"
Data Bridge: Design"
Construct multi-dimensional social networks for data. "
Three challenges:"
•  Evaluate multiple types of "
“metrics” on data"
–  Domain-specific, genre-specific, "
project-specific"
–  Use Socio-metric Network Algorithms"
–  Similar to – but for data"
•  Find relevance"
–  Slices of similarity"
–  Explore Relationships between "
Data, Users, Resources, Methods, Workflows…"
–  Use Relevance Algorithms"
•  Create communities"
–  Use Clustering Algorithms"
•  Provide an extensible & big data framework"
–  Democratize the process"
Data Bridge"
13"
Data Bridge Infrastructure"
•  Accommodate for multiple, extensible number of "
SNA, RA and CA algorithms"
•  Provision an easy way to “add” new algorithms"
–  Crowd sourcing"
–  MyVector (add your own way of defining metrics)"
•  Provision an easy way to connect data to algorithms"
–  Multiple ways of finding similarities"
–  Multiple ways of providing search criteria"
–  MyBridge (add your own way of finding relevance)"
•  Provision an easy way to form communities"
–  Multiple ways of categorizing data"
–  MyCommunity (add your own "
domain-specific clustering)"
•  Make it a distributed system "
–  Grow and Shrink as needed"
–  Make it easy for third-party setups"
–  Federation of Data Bridge"
"
http://psychosocial.actalliance.org/default.aspx?di=65446&ptid=66401
Data Bridge"
14"
Data Bridge Architecture"
15"
Message-oriented Architecture"
•  Loosely linked processes: Messages make the connections"
•  Scenario: "
–  User A has a novel, “signature” detection algorithm for gene sequences"
–  User A wraps algorithm with API provided by DataBridge"
•  Subscribes to a message for “gene sequence” data "
–  User B publishes a new gene sequence into DataBridge"
•  A new message is created informing the new addition"
•  User A’s algorithm detects, computes relevance to “signature”"
•  Publishes a new relevance message for this “signature”"
–  User C has a relevance algorithm that catches A’s message"
•  Uses it to add B’s sequence to a new data community"
–  Other signature detectors may also look at B’s gene sequence"
•  Publish their own relevance, if applicable"
–  User A can also look at "
‘older’ gene sequences and "
find relevance"
Data Bridge"
Image from support.oyala.com 16"
Data Bridge: A Social Network for Long Tail
Science Data"
Outline of the Talk"
•  Motivation"
•  Design"
•  Implementation Status"
•  Examples"
•  Future"
"
Data Bridge"
17"
Modules: Current Status
"
•  RabbitMQ Messaging
system"
•  Ingest Engine"
•  Relevance Engine"
•  Network Engine"
•  Ingestion GUI"
•  DataVerseNetwork access"
•  Meta database"
–  MongoDB"
•  Network Database"
–  Neo4J"
•  Viz Display "
Data Bridge"
18"
Messages"
Message Listener Originator Ingest Metadata Ingest Engine Any Data Provider (DVN) Metadata Available Relevance Engine Metadatabase Create Similarity Matrix Relevance Engine Any Ingest Engine Similarity Matrix Available Network Engine Any Relevance Engine Insert Similarity Matric Ingest Engine User/App Run SNA Algorithm Network Engine Network Engine or user/App SNA Data Available Network Engine Network Database Create Visualiza(on Data Network Database Network Visualiza(on Show Visualiza(on Visualizer User (WebApp) Data Bridge"
19"
Example Message Schema"
Name Header Insert.Metadata.Java.URI.MetadataDB System headers Example Value type databridge subtype ingestmetadata User provided headers Example Value className org.renci.databridgecontrib.ingest.mockingest nameSpace system_test inputURI /projects/databridge/metadata.xml W3.org Data Bridge"
20"
Data Bridge: A Social Network for Long Tail
Science Data"
Outline of the Talk"
•  Motivation"
•  Design"
•  Implementation Status"
•  Examples"
•  Future"
"
Data Bridge"
21"
Screenshot: Finding similarities"
Select Network Data Filter Connec(vity by similarity value Data Bridge"
22"
Screenshot: Weight of similarity"
Similarity measure: 0.5 23"
Screenshot: Highlights of similarities"
Link to the data 24"
Screenshot: Data Access"
Data Bridge"
"
25"
Screenshot: Simple Ingest GUI"
Data Bridge"
"
26"
Data Bridge: A Social Network for Long Tail
Science Data"
Outline of the Talk"
•  Motivation"
•  Design"
•  Implementation Status"
•  Examples"
•  Future"
"
Data Bridge"
27"
Next Steps"
•  Basic Framework implemented "
–  Applied to a few thousands of datasets"
–  Work to do, advanced features"
–  Documentation"
–  Scaling tests"
–  More types data/metadata to be tested"
•  Ready for new algorithms"
•  Ready for more data"
•  Ready for larger usage"
•  Investigate multiple"
similarity measures"
–  Usage, Methods as "
relevance"
Data Bridge"
28"
Players"
• 
• 
• 
• 
• 
• 
• 
• 
Howard Lander"
Justin Zhan"
Merce Crosas"
Gary King"
Jon Crabtree"
Tom Carsey"
Sharlini Shankaran"
Arcot Rajasekar"
http://www.gotyourhandsfull.com/2009/09/twins-and-more-on-a-tuesday-post-natal-depression-in-multiple-birth-mums.html
Data Bridge"
29"
Conclusion"
DataBridge"
• 
• 
• 
• 
• 
Motivation"
Design"
Implementation Status"
Examples"
Future"
http://databridge.web.unc.edu"
Arcot Rajasekar"
rajasekar@unc.edu
The University of North Carolina at Chapel Hill
"