PB138 — XML databases, NoSQL databases NoSQL databases • non relational databases, flexible schema • often used for big data applications, clusters • different storage structure than SQL databases • give up constraints/transactions to improve performance • low-level interface NoSQL types • key-value ◦ Redis, Memcached, Amazon SimpleDB… • document (JSON, XML…) ◦ CouchDB, Elasticsearch, MongoDB… • graph / RDF triple ◦ Virtuoso, Neo4j… • object ◦ Caché, GemStone… RDF databases / triple store • standard data model (RDF) • standardized interchange format (N-Triples, N-Quads, XML,…) • query language (SPARQL), Linked Data • native ◦ Apache Jena, Sesame/RDF4J… • RDF layer to relational database ◦ Virtuoso, IBM DB2… SPARQL • SPARQL Protocol and RDF Query Language • W3C Recommendation SPARQL 1.1, March 2013 1 • SELECT - values as table • CONSTRUCT - extract RDF • ASK - true/false • DESCRIBE - extract RDF graph • inferencing SPARQL example Ontology ex1:FullProfessor rdf:subClassOf ex1:Professor. ex1:AssistantProfessor rdf:subClassOf ex1:Professor. ex1:Professor owl:equivalentClass ex2:Teacher Data ex1:Bob rdf:type ex1:FullProfessor . ex1:Alice rdf:type ex1:AssistantProfessor . ex2:Mary rdf:type ex2:Teacher SPARQL example Data ex1:Bob rdf:type ex1:FullProfessor . ex1:Alice rdf:type ex1:AssistantProfessor . ex2:Mary rdf:type ex2:Teacher SPARQL query SELECT ?x  WHERE {  ?x rdf:type ex1:Professor  } • noone is Professor, but inferencing will find Bob, Alice, Mary XML databases, when to use • working with documents or metadata in XML format • data format/schema changes over time • complex and variable schema 2 • structure queries XML database concepts • basic element = document • documents gathered in collections ("tables") • query on document structure • output is document, document fragment, or constructed XML XML database types • XML-enabled databases ◦ mapping XML data to own data model (relational, object…) ▪ character large object ▪ fragmented to series of tables/objects ▪ stored in XML Type ◦ ISO SQL/XML - element construction, data mapping, enhanced SQL with XQuery ◦ Oracle, IBM DB2, MS SQL, PostgreSQL XML database types • native XML databases ◦ using XML data model directly • open-source ◦ eXist, Sedna, BaseX, MonetDB, Oracle/Berkeley DB XML • commercial ◦ MarkLogic, Virtuoso, Qizx Interface XQuery { for $book in collection("books")/book where $book/year="1990" return $book/title } 3 Interface XQuery Update update delete collection("books")/book/isbn update insert XQuery Guide  into collection("books")/book[isbn="42"] Interface SQL/XML (result-table) SELECT   id, vol,   xmlquery('$j/name', passing journal as "j") as name FROM   journals WHERE   xmlexists('$j[licence="CreativeCommons"]',   passing journal as "j") Interface SQL/XML (result-XML) SELECT XMLELEMENT (NAME "saleProducts",   XMLNAMESPACES (DEFAULT 'http://posample.org'),   XMLAGG (XMLELEMENT (NAME "prod",   XMLATTRIBUTES (p.Pid AS "id"),   XMLFOREST (p.name AS "name",   i.quantity AS "numInStock")))) FROM PRODUCT p, INVENTORY i WHERE p.Pid = i.Pid • XSLT - output transformation • XML Schema - input validation Interface • XQJ - XQuery API for Java ◦ unified query layer between application and XML Datasource 4 ◦ prepared statements ◦ binding variables Interface XQDataSource xqs = new ExistXQDataSource(); XQDataSource xqs = new SednaXQDataSource(); xqs.setProperty("serverName", "localhost"); XQConnection conn = xqs.getConnection(); XQExpression xqe = conn.createExpression(); String xqueryString = "for $x in doc('books.xml')//book   return $x/title/text()"; XQResultSequence rs = xqe.executeQuery(xqueryString); while(rs.next())   System.out.println(rs.getItemAsString(null)); conn.close(); Interface • XML:DB • similar concept to JDBC, abstract interface to XML database ◦ Driver - access to given database ◦ Collection - document collection in database ◦ Services - support database features, e.g. XPathQueryService, XUpdateQueryService ◦ Resource - data stored in database ◦ ResourceSet - data as result of query Storage • intact document storage ◦ unique identifier for document ◦ preferably parse and index on storage ◦ on query: fast, if application need access to whole document and index can select the right document ◦ slow, if additional parsing needed Storage • parsing documents ◦ document parsed on save and stored in own data model (eg. DOM) 5 ◦ addressable numbered nodes - some operations faster based on numbers ◦ no need for parsing on query, more efficient ◦ retrieved document not 100% same ◦ fine granularity for addressing ◦ partial modifications Indexing • index scope - collection, database, document • index target - document, node • value index ◦ store all combinations of element/attribute values • substring index ◦ for contains() etc., store n-grams Indexing • structural index ◦ track existing paths, enriched tree, trie, DataGuide etc. Benchmark • compare database performance • mostly XQuery speed, less often Update • data generator (up to GBs) and a set of XQueries • XMark, XBench, XMach-1 • TPoX - complex database testing, XQuery and SQL/XML, indexing, XML Schema, XQuery Update 6