Databases and data modelling Lecturer: Jiaheng Lu Spring 2016 www.helsinki.fi 10.10.2016 1 Data storage and history Before-1950s Data was stored as paper records Lot of time was wasted. e.g. when searching. Therefore inefficient. www.helsinki.fi Magnetic tapes and hard disk • 1950s and early 1960s: Data processing using magnetic tapes for storage • Late 1960s and 1970s: Hard disks allow direct access to data • • Data stored in files Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu www.helsinki.fi 10.10.2016 3 Drawbacks of file system • Each program has its own data format • Programs are written in different languages, and so cannot easily access each other’s files. • Any new requirement needs a new program Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu www.helsinki.fi 10.10.2016 4 Database Approach • • • • • 1960’s Network databases 1970’s Relational databases 1990’s Object-oriented and object-relational 1995+ XML, Mobile, GeoDB, Embedded DB 2005+ NoSQL DB, NewSQL DB Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu www.helsinki.fi 10.10.2016 5 History of databases: Turing awards 1973 Charles W. Bachman 1998 Jim Gray 1981 Edgar F. Codd 2014 www.helsinki.fi Michael Stonebraker 6 History of databases: Turing awards Object-relational model, column stores,…Modern databases Distributed databases and transaction Network databases Relational databases 2014 Michael Stonebraker 1998 Jim Gray 1981 Edgar F. Codd 1973 Charles W. Bachman www.helsinki.fi 7 Data model describes the characteristics • Structure • Operations • Constraints Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu www.helsinki.fi 10.10.2016 8 Data model describes the characteristics • Structure (John, 32, Male) (Mary, 27, Female) (Anna, 57, Female) • Operations • Constraints Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu www.helsinki.fi 10.10.2016 9 Data model describes the characteristics • Structure • Operations E.g. Join and Union • Constraints Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu www.helsinki.fi 10.10.2016 10 Data model describes the characteristics • Structure • Operations • Constraints Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu Age is never negative and always smaller than 150 www.helsinki.fi 10.10.2016 11 Two structures of data files Structure 1 (John, 32, Male) (Mary, 27, Female) (Anna, 57, Female) Structure 2 (John, 32, Male, Engineer) (Mary, 27, Female, Doctor) (Anna, 57, Female, Teacher)c Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu www.helsinki.fi 10.10.2016 12 Examples of operations • Subsetting • Given a condition and a set of data, find a subset of data which satisfy the condition • Substructure extracting • Extract from each data item a part of structure as specified by a condition • Union and Join Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu www.helsinki.fi 10.10.2016 13 Types of constraints • Value constraints • Age is never negative • Uniqueness constraints • Any course can have only one ID • Cardinality constraints • Each person can have at most three blood pressure records in the system Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu www.helsinki.fi 10.10.2016 14 Types of constraints (cont’d) • Type constraint • Age is an integer • Domain constraint • Month is between 1 to 12 • Structure constraints • Put constraints on structure rather than data values Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu www.helsinki.fi 10.10.2016 15 Relational models • A collection of tables • No duplicates tuples • Dissimilar tuples disallowed (John, 32, Male, Engineer) (Mary, 27, Female, Doctor) (Anna, 57, Female, Teacher) (23, 87, Computer science, Teacher) Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu www.helsinki.fi 10.10.2016 16 Relational models • A collection of tables • No duplicates tuples • Dissimilar tuples disallowed (John, 32, Male, Engineer) (Mary, 27, Female, Doctor) (Anna, 57, Female, Teacher) (23, 87, Computer science, Teacher) Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu www.helsinki.fi 10.10.2016 17 Relational models (cont’d) • A collection of tables • No duplicates tuples • Dissimilar tuples disallowed • Foreign keys and primary keys Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu www.helsinki.fi 10.10.2016 18 Join • Join operation for two tables (John, 32, Male, Engineer) (Mary, 27, Female, Doctor) (Anna, 57, Female, Teacher) (John, Married, ) (Mary, Single) (Anna, Married,) (John, 32, Male, Engineer Married, ) (Mary, 27, Female, Doctor Single) (Anna, 57, Female, Teacher Married,) Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu www.helsinki.fi 10.10.2016 19 Semi-structured model: XML • eXtensible Markup Language • Markup language for documents containing structured information www.helsinki.fi XML…. • Based on Standard Generalized Markup Language (SGML) • Version 1.0 introduced by World Wide Web Consortium (W3C) in 1998 • Bridge for data exchange on the Web www.helsinki.fi A Simple XML Document Freely definable tags <article> <author>Gerhard Weikum</author> <title>The Web in Ten Years</title> <text> <abstract>In order to evolve...</abstract> <section number=“1” title=“Introduction”> The <index>Web</index> provides the universal... </section> </text> </article> www.helsinki.fi 22 April 29th, 2003 A SimpleStart XML Document Tag <article> <author>Gerhard Weikum</author> Content of <title>The Web in Ten Years</title> the Element <text> (Subelements <abstract>In order to and/or Text) evolve...</abstract> End Tag <section number=“1” title=“Introduction”> The <index>Web</index> provides the universal... </section> </text> </article> www.helsinki.fi 23 April 29th, 2003 A Simple XML Document <article> <author>Gerhard Weikum</author> <title>The Web in Ten Years</title> <text> <abstract>In order to evolve...</abstract> <section number=“1” title=“Introduction”> The <index>Web</index> provides the universal... </section> Attributes with name and value </text> </article> www.helsinki.fi 24 April 29th, 2003 Elements in XML Documents • (Freely definable) tags: article, title, author • with start tag: <article> etc. • and end tag: </article> etc. • Elements: <article> ... </article> • Elements have a name (article) and a content (...) • Elements may be nested. • Elements may be empty: <this_is_empty/> www.helsinki.fi 25 Elements vs. Attributes What is the difference between elements and attributes? • Only one attribute with a given name per element (but an arbitrary number of subelements) • Attributes have no structure, simply strings (while elements can have subelements) As a rule of thumb: • Content into elements • Metadata into attributes Example: <person born=“1912-06-23“ died=“1954-06-07“> Alan Turing</person> proved that… www.helsinki.fi 26 XML Documents as Ordered Trees article author title text number=“1“ abstract Gerhard Weikum section title=“…“ In order … The Web in 10 years The index Web www.helsinki.fi 27 provides … Document Type Definitions (DTD) • An XML document may have an optional DTD. • DTD serves as grammar for the underlying XML document, and it is part of XML language. • DTD has the form: <!DOCTYPE name [markupdeclaration]> www.helsinki.fi DTD (cont’d) • Consider an XML document: <db><person><name>Alan</name> <age>42</age> <email>agb@usa.net </email> </person> <person>………</person> ………. </db> www.helsinki.fi DTD (cont’d) • DTD for it might be: <!DOCTYPE db [ <!ELEMENT db (person*)> <!ELEMENT person (name, age, email)> <!ELEMENT name (#PCDATA)> <!ELEMENT age (#PCDATA)> <!ELEMENT email (#PCDATA)> <! AttributeListemail (#PCDATA)> ]> www.helsinki.fi DTD (cont’d) Occurrence Indicator: Indicator Occurrence (no indicator) Required ? Optional * Optional, repeatable Required, repeatable + One and only one None or one None, one, or more One or more www.helsinki.fi Important attribute types • There are ten attribute types • These are the most important ones: • CDATA The value is character data • (man|woman|child) The value is one from this list • ID The value is a unique identifier ‒ ID values must be legal XML names and must be unique within the document www.helsinki.fi More attribute types • IDREF The ID of another element • IDREFS A list of other IDs • ENTITY An entity • ENTITIES A list of entities • NOTATION A notation www.helsinki.fi Requirements • Recall that an attribute has the form <!ATTLIST element-name name type requirement> • The requirement is one of: • A default value, enclosed in quotes ‒ Example: <!ATTLIST degree CDATA "PhD"> • #REQUIRED ‒ The attribute must be present • #IMPLIED ‒ The attribute is optional • #FIXED "value" ‒ The attribute always has the given value ‒ If specified in the XML, the same value must be used www.helsinki.fi Another example: XML <?xml version="1.0"?> <!DOCTYPE weatherReport SYSTEM "http://www.mysite.com/mydoc.dtd"> <weatherReport> <date>05/29/2002</date> <location> <city>Philadelphia</city>, <state>PA</state> <country>USA</country> </location> <temperature-range> <high scale="F">84</high> <low scale="F">51</low> </temperature-range> </weatherReport> www.helsinki.fi The DTD for this example <!ELEMENT weatherReport (date, location, temperature-range)> <!ELEMENT date (#PCDATA)> <!ELEMENT location (city, state, country)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT country (#PCDATA)> <!ELEMENT temperature-range ((low, high)|(high, low))> <!ELEMENT low (#PCDATA)> <!ELEMENT high (#PCDATA)> <!ATTLIST low scale (C|F) #REQUIRED> <!ATTLIST high scale (C|F) #REQUIRED> www.helsinki.fi Operations on XML documents • GetParent • GetChildren • GetSibling • Root-node path • Query needs the tree traversal Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu www.helsinki.fi 10.10.2016 37 Querying XML with XPath and XQuery XPath and XQuery are query languages for XML data, both standardized by the W3C and supported by various database products. A query result is a set of qualifying nodes, paths, subtrees, or subgraphs from the underyling data graph, or a set of XML documents constructed from this raw result. 38 www.helsinki.fi XPath • XPath is a simple language to identify parts of the XML document (for further processing) • XPath operates on the tree representation of the document • Result of an XPath expression is a set of elements or attributes • Discuss abbreviated version of XPath www.helsinki.fi 39 Elements of XPath • An XPath expression usually is a location path that consists of location steps, separated by /: /article/text/abstract: selects all abstract elements • A leading / always means the root element • Each location step is evaluated in the context of a node in the tree, the so-called context node • Possible location steps: • • • • child element x: select all child elements with name x Attribute @x: select all attributes with name x Wildcards * (any child), @* (any attribute) Multiple matches, separated by |: x|y|z www.helsinki.fi 40 Combining Location Steps • Standard: / (context node is the result of the preceding location step) article/text/abstract (all the abstract nodes of articles) • Select any descendant, not only children: // article//index (any index element in articles) • Select the parent element: .. • Select the content node: . The latter two are important when using predicates. www.helsinki.fi 41 Predicates in Location Steps • Added with [] to the location step • Used to restricts elements that qualify as result of a location step to those that fulfil the predicate: • a[b] elements a that have a subelement b • a[@d] elements a that have an attribute d • Plus conditions on content/value: ‒ a[b=„c“] ‒ A[@d>7] ‒ <, <=, >=, !=, … www.helsinki.fi 42 XPath by Example /literature/book/author retrieves all book authors: /literature/(book|article)/author /literature/*/author /literature//author /literature//@year authors of books or articles authors of books, articles, essays, etc. authors that are descendants of literature value of the year attribute of descendants of literature /literature//author[firstname] authors that have a subelement firstname /literature/book[author//country = “Germany“] books with German author www.helsinki.fi 43 Semi-structure model: JSON • JSON is a data interchange format • Interactive Web 2.0 applications, no more use page replacement. Data transfer without refreshing a page. • The most important aspects of data transfer are simplicity, extensibility, interoperability, openness and human readability • Key idea in AJAX – Asynchronous Java Script and XML. www.helsinki.fi How does it work? • JSON is a subset of Java Script. JSON can be parsed by a Java Script parser. • It can represent either complex or simple data as it has data types • They are Strings, Number, Boolean, Objects and Arrays • E.g. of Object: • { "name": "Jack Nimble", "format": { "type": "rect", "width": 120, "interlace": false}} www.helsinki.fi • An array can be shown as • ["Sunday", "Monday", "Tuesday", "Wednesday“] • All data types are intuitive and similar to other programming languages • Also compatible with other languages like C, C++, C#, ColdFusion, Python and many more. www.helsinki.fi Graph data model Nodes table: 1. John 2. Mary 3. Anna John Friend Anna Mary Sister Matemaattis-luonnontieteellinen tiedekunta / Henkilön nimi / Esityksen nimi Edge table: John, Mary,Friend Mary,Anna,Sister www.helsinki.fi 10.10.2016 47 Graphs from the Real World Königsberg's Bridges Ref: http://en.wikipedia.org/wiki/Seven_Bridges_of_K%C3%B6nigsberg www.helsinki.fi Graphs from the Real Word Webpage Hyperlink Graph Directed Communities Network of Word Associations Overlapping Communities www.helsinki.fi Operations on graphs • GetNeighbour • ShortestPath(A,B) • Community detection Matemaattis-luonnontieteellinen tiedekunta / Iso tiedonhallinta/ Jiaheng Lu www.helsinki.fi 10.10.2016 50 Applications of Community Detection • Website mirror server assignment • Recommendation system • Social network role detection • Functional module in biological networks • Graph coarsening and summarization • Network hierarchy inference www.helsinki.fi Defining Communities • Intuition: There are more edges inside a community than edges connected with the rest of the graph www.helsinki.fi General Challenges • Many clustering problems are NP-hard. Even polynomial time approaches may be too expensive • Call for scalable solutions • Concepts of “cluster”, “community” are not quantitatively well defined www.helsinki.fi
© Copyright 2024