SEMANTIC TECHNOLOGY & BUSINESS CONFERENCE SAN FRANCISCO, JUNE 5, 2012 | HOW TO INTEGRATE LINKED DATA INTO YOUR APPLICATION LDIF Team: Andreas Schultz, Freie Universität Berlin Andrea Matteini, mes|semantics Robert Isele, Freie Universität Berlin Pablo N. Mendes, Freie Universität Berlin Christian Becker, mes|semantics Christian Bizer, Freie Universität Berlin With contributions by: Hannes Mühleisen, Freie Universität Berlin; William Smith, Vulcan Inc. | WHAT IS LINKED DATA? • Raw data (RDF) • Accessible on the web • Data can link to other data sources Thing Thing Thing Thing Thing Thing Thing Thing Thing Thing data link A data link B data link C data link D • Benefits: Ease of access and re-use; enables discovery • One API for all data sources? E | LINKING OPEN DATA CLOUD Magnatune DB Tropes Hellenic FBD Hellenic PD Crime Reports UK Ox Points Media Geographic Publications User-generated content Government Open Election Data Project EU Institutions Mortality (EnAKTing) Ordnance Survey legislation UK Postcodes ESD standards ISTAT Immigration Lichfield Spending Scotland Pupils & Exams Traffic Scotland reference uk London Gazette TWC LOGD Eurostat Eurostat (FUB) (Ontology Central) GovTrack Finnish Municipalities World Factbook Geo Species UMBEL El Viajero Tourism BNB BibBase DBLP (FU Berlin) Uberblic Daily Med dataopenac-uk Diseasome SIDER Twarql EUNIS Cornetto PDB SMC Journals Ocean Drilling Codices Turismo de Zaragoza Janus AMP Climbing Linked GeoData Alpine Ski Austria AEMET Metoffice Weather Forecasts Weather Stations Yahoo! Geo Planet GEMET ChEMBL Open Data Thesaurus Airports National Radioactivity JP Sears STW Pisa ProDom PubMed Linked Open Colors SGD Gene Ontology NVD IBM DEPLOY Newcastle LOCAH Roma CiteSeer Courseware dotAC ePrints VIVO Cornell OMIM MGI InterPro Smart Link Product Types Ontology Open Corporates Italian Museums Amsterdam Museum UniParc UniRef UniSTS GeneID meducator Reactome OGOLOD KEGG Pathway Medi Care Google Art wrapper Linked Open Numbers KEGG Drug Pub Chem UniPath way Chem2 Bio2RDF Homolo Gene Scholarometer IRIT ACM RAE2001 STITCH GESIS RESEX IEEE RISKS PROSITE AGROV OC Product DB DBLP (RKB Explorer) HGNC (Bio2RDF) Affymetrix SISVU Swedish Open Cultural Heritage Budapest LAAS KISTI NSF JISC WordNet (RKB Explorer) EARTh lobid Organisations ECS (RKB Explorer) VIVO Indiana UniProt LODE WordNet (W3C) Wiki ECS Southampton ECS Southampton EPrints Eurécom LinkedCT Taxono my NSZL Catalog Resources P20 Pfam UniProt WordNet (VUA) lobid UN/ LOCODE Drug Bank Enipedia Lexvo DBLP (L3S) ERA lingvoj Europeana Deutsche Biographie OAI data dcs TCM Gene DIT VIAF Ulm data OS YAGO Open Cyc riese ndlna Freebase dbpedia lite Norwegian MeSH GND UB Mannheim Calames RDF Book Mashup Project Gutenberg Rådata nå! PSH IdRef Sudoc iServe Geo Names LIBRIS LCSH Sudoc DDC Open Calais Greek DBpedia DBpedia GeoWord Net Piedmont Accomodations URI Burner ntnusc MARC Codes List US Census (rdfabout) Italian public schools New York Times LEM RAMEAU SH Thesaurus W SW Dog Food Portuguese DBpedia t4gm info LinkedL CCN theses. fr Revyu Fishes of Texas (rdfabout) Scotland Geography Linked MDB Event Media US SEC Semantic XBRL FTS Chronicling America Telegraphis Linked Sensor Data (Kno.e.sis) Eurostat Linked EDGAR (Ontology Central) EURES Life sciences (RKB Explorer) BBC Music Geo Linked Data CORDIS CORDIS (FUB) Pokedex NDL subjects Open Library (Talis) Plymouth Reading Lists my Experiment flickr wrappr NTU Resource Lists Open Library SSW Thesaur us semantic BBC Wildlife Finder NASA (Data Incubator) transport uk Source Code Ecosystem Linked Data Didactal ia Goodwin Family St. Andrews Resource Lists Manchester Reading Lists gnoss Poképédia Classical (DB Tune) Taxon Concept LOIUS Jamendo (DBtune) Last.FM (rdfize) BBC Program mes Rechtspraak. nl Openly Local intervals Music Brainz (DBTune) Ontos News Portal Sussex Reading Lists Bricklink yovisto Semantic Tweet Linked Crunchbase RDF ohloh (Data Incubator) (DBTune) OpenEI statistics uk GovWILD Brazilian Politicians educatio Lotico Discogs FanHubz patents data.go research uk CO2 Emission (EnAKTing) Energy (EnAKTing) EEA Data Cross-domain NHS (EnAKTing) Surge Radio Klappstuhlclub Music Brainz (zitgist) (Data Incubator) Last.FM artists Population (EnAKTing) reegle Ren. Energy Generators (DBTune) tags2con delicious Slideshare 2RDF (DBTune) Music Brainz John Peel EUTC Productions business uk Crime (EnAKTing) GTAA Linked User Feedback LOV Audio Scrobbler Moseley Folk VIVO UF ECCOTCP bible ontology KEGG Enzyme PBAC KEGG Reaction KEGG Compound KEGG Glycan As of September 2011 | TYPES OF LINKED DATA VERY SOON? Open, Public Data (LOD Cloud) Linked Enterprise Data Commercial Linked Data ... AND WHAT YOU CAN DO WITH THEM • Provide interfaces on top of them • Augment your website • Integrate them into your application logic • Create specialized data marts | AUGMENT YOUR WEBSITE: BBC BBC online properties make intensive use of data from Wikipedia and MusicBrainz | DATA MARTS: NEUROWIKI • NeuroWiki creates views for genes, drugs and diseases data from four RDF data sources • Provides navigation and composition tools for accessing and mining the data | APPLICATION LOGIC: IBM WATSON • IBM Watson makes use of Linked Data sources such as DBpedia | 4 STEPS TO LINKED DATA INTEGRATION | STEP #1: ACCESS LINKED DATA • Linked Data is published via HTTP, SPARQL endpoints, RDF dumps Architecture On-The-Fly Dereferencing Access Methods HTTP Dump SPARQL Dereferencing import X X Query Federation Crawling and Caching X X X Decision Factors Recency Speed / Scalability High Low High Decreases exponentially as new sources are added Depends High Reliability Complexity Low High Low Moderate with SPARQL 1.1 SERVICE clause High High Adapted from: Linked Data: Evolving the Web into a Global Data Space (Heath/Bizer 2011) • Live access allows quick prototyping and limited production use • As data sets grow in size and more data sources are added, a crawling/caching architecture often becomes necessary | STEP #1: ACCESS LINKED DATA Implementations: • On-the-fly dereferencing • • Query federation • • LDspider, SQUIN, Semantic Web Client library SPARQL 1.1 SERVICE clause Crawling and Caching • Triplestore import script • Public caches (e.g. Sindice, OpenLink LOD endpoint) • LDIF | STEP #2: NORMALIZE VOCABULARIES Data sources that overlap in content use a wide range of vocabularies. swrcbibpo tldcam mpeg7 rdfg compass txnwot metalex doap wdrs admingeo vann orgapi sawsdl sdmx geospecies xmlqb rev vu-wordnet umbel uniprot dc http scovo void tag dbp bio ore dbo gr dbpedia event time xsd frbr geonames cc sioc vcard mo bibo akt xhtml foaf geo skos Most widely used vocabularies in the LOD cloud (08/10/2011) Source: FU Berlin / DERI; • Over 60 % of all LOD sources use proprietary vocabularies • It’s up to the data consumer to normalize the vocabularies • Enterprise: Need to translate between internal and external vocabularies | STEP #2: NORMALIZE VOCABULARIES Approaches to Schema Mapping: • Hand-crafting queries against individual sources – no different than an API OPTIONAL { ?ow fb:location.location.containedby [ ot:preferredLabel ?city_fb_con ] } . OPTIONAL { ?ow dbp-prop:location ?loc. ?loc rdf:type umbel-sc:City ; ot:preferredLabel ?city_db_loc } OPTIONAL { ?ow dbp-ont:city [ ot:preferredLabel ?city_db_cit ] } Source: • Ontology Representation Languages: OWL, RDFS • Rules: SWRL, RIF • Query Languages • SPARQL CONSTRUCT clause • TopQuadrant SPARQLMotion • Mosto • R2R (part of LDIF) | STEP #2: NORMALIZE VOCABULARIES Using SPARQL: • Rename a class CONSTRUCT { ?s a mo:MusicArtist } WHERE { ?s a dbpedia-owl:MusicalArtist } • Value transformation CONSTRUCT { ?s movie:runtime ?runtimeInMinutes . } WHERE { ?s dbpedia-owl:runtime ?runtime . BIND(?runtime * 60 As ?runtimeInMinutes) } • Create URI from literal CONSTRUCT { ?s diseasome:omim ?omimuri . ?omimuri dc:identifier ?identifier . } WHERE { ?s dbpedia-owl:omim ?omim . BIND(IRI(concat(“”, ?omim)) As ?omimuri) BIND(concat(“omim:”, ?omim) As ?identifier) } Slide credits: Andreas Schultz | STEP #3: RESOLVE IDENTIFIERS Data sources that overlap in content use different identifiers for the same real-world entity. 98 1 linked data sets • Most LOD sources only provide owl:sameAs links to one other data source • It’s up to the data consumer to generate additional links • Enterprise: Need to link both internal and external resources 62 2 linked data sets 38 3 linked data sets 19 4 linked data sets 5 linked data sets 5 6 - 10 linked data sets 17 > 10 linked data sets 27 0 25 50 75 100 Number of linked data sets per source (08/10/2011) Source: FU Berlin / DERI; | STEP #3: RESOLVE IDENTIFIERS Approaches to Identity Resolution: • Improvised or manual merging • Rule-based approaches: • SILK (part of LDIF) • LIMES Union Sq., New York Union Sq., Seattle Union Sq., San Francisco N ′W 4 ° 2 37 2° 12 ′ 47 Union Square N ′W 4 ° 2 37 2° 12 ′ 47 Union Sq. = Union Sq., San Francisco | STEP #4: FILTER DATA Data sources that overlap in content provide data that is conflicting and of varying quality. • • Data sources have... • ... different knowledge levels, views or intents • ... wrong, biased, inconsistent or outdated information Approaches: • Import data into distinct Named Graphs; query them separately using the SPARQL GRAPH clause • Sieve (part of LDIF) | LDIF – LINKED DATA INTEGRATION FRAMEWORK Integrates Linked Data from multiple sources into a clean, local target representation while keeping track of data provenance NEW 1 Collect data: Managed download and update 2 Translate data into a single target vocabulary 3 Resolve identifier aliases into local target URIs 4 Cleanse data; resolving the conflicting values 5 Output • Follows the Crawling and Caching Architecture Pattern • Open source (Apache License, Version 2.0) • Collaboration between Freie Universität Berlin and mes|semantics | LDIF PIPELINE 1 Collect data Supported data sources: 2 Translate data • RDF dumps (all common formats) • SPARQL Endpoints • Crawling Linked Data via HTTP 3 Resolve identities 4 Cleanse data 5 Output | LDIF PIPELINE 1 Collect data 2 Translate data dbpedia-owl: City Resolve identities schema:Place 4 Cleanse data fb:location.citytown 5 Output 3 Sources use a wide range of different RDF vocabularies R2R local:City • Simple mappings using OWL / RDFS statements (x rdfs:subClassOf y) • Complex mappings with SPARQL expressivity • Built-in transformation function library (XPath) | LDIF PIPELINE 1 Collect data 2 Translate data 3 Resolve identities 4 5 Cleanse data Sources use different identifiers for the same entity Union Sq., New York Union Sq., Seattle Union Sq., San Francisco N 7′ ′ W 4 4 ° 37 2°2 12 Union Square Output N 7′ ′ W 4 4 ° 37 2°2 12 Silk Union Sq. = Union Sq., San Francisco • Automated link creation based on Link Specifications • Supports various comparators and transformations (string similarity, basic arithmetics, time, geographical distance) | LDIF PIPELINE 1 Collect data 2 Translate data 3 Resolve identities Sources provide different values for the same property San Francisco population is 0.7M ★ 4 Cleanse data 5 Output San Francisco population is 0.8M ★ ★ ★ Sieve San Francisco population is 0.8M ★ 1. Quality Assessment – assign quality scores to Named Graphs (by time, by source preference, thresholds) 2. Data Fusion – resolve conflicting property values (according to quality scores, frequency, averages) | LDIF PIPELINE 1 Collect data Output options: 2 Translate data 3 Resolve identities 4 Cleanse data 5 Output • N-Quads • N-Triples • SPARQL Update Stream • Provenance tracking using Named Graphs ! | ! ! ! LDIF ARCHITECTURE Application!Layer! Application!Code!! SPARQL!or!RDF!API! Data!Access,!! Integration!and!! Storage!Layer! !!!!!!LDIF!! !! Web!Data! Access!Module! ! Data! Translation! Module! ! Identity! Resolution! Module! ! Data!Quality! and!Fusion! Module! Integrated! Web!Data! HTTP! Web!of!Data! HTTP! Publication!Layer! LD!Wrapper! Database!A! HTTP! LD!Wrapper! Database!B! HTTP! RDFa! CMS! RDF/X ML! | VERSIONS • In-memory • • • fast, but scalability limited by local RAM RDF Store (TDB) • stores intermediate results in a Jena TDB RDF store • can process more data than In-memory but doesn't scale Cluster (Hadoop) • scales by parallelizing work across multiple machines using Hadoop • can process a virtually unlimited amount of data • ready for Amazon Elastic MapReduce | BENCHMARKS KEGG GENES VS. UNIPROT (CLUSTER) 300M TRIPLES 3.6B TRIPLES | Q&A | THANKS! • Early adopters wanted! • Website: • Google Group: • • Supported in part by • • Vulcan Inc. as part of its Project Halo • EU FP7 project LOD2 - Creating Knowledge out of Interlinked Data (Grant No. 257943) Slide credits: Andrea Matteini, Robert Isele, Andreas Schultz
