Lecture 5 of NoSQL Databases (PA195) David Novak & Vlastislav Dohnal Faculty of Informatics, Masaryk University, Brno Key-value Stores II Embedded, Distributed, and In-memory Stores Key-value Stores: Basics ● A simple hash table (map), primarily used when all accesses to the database are via primary key ○ key-value mapping ● In RDBMS world: A table with two columns: ○ ID column (primary key) ○ DATA column storing the value (unstructured BLOB) ● Basic operations: ○ Get the value for the key value:= get(key) ○ Put a value for a key put(key, value) ○ Delete a key-value delete(key) 2 Querying ● We can query by the key ● To query using some attribute of the value is not possible (in general) ○ We need to read the value to test any query condition ● What if we do not know the key? ○ Some systems support additional functionality ■ Using some kind of additional index (e.g., full text) ■ The data must be indexed first ■ Example later: Riak search 3 Techniques in Distributed Stores Consistent hashing Sharding (data partitioning) Virtual nodes Balancing of data Replication (consecutive nodes) Read/write scalability & reliability Read/write quora Consistency and r/w efficiency Vector stamps Avoid/detect update conflicts Gossip protocol Node join/leave/failure Multi-version concurrency control Transaction isolation Two-phase commit protocol (2PC) Distributed transactions 4 Representatives Project Voldemort Ranked list: http://db-engines.com/en/ranking/key-value+store 5 Agenda ● Embedded local storages ○ LevelDB ■ Local storage for many systems, Log-structured Merge Tree ● Distributed key-value Stores - representatives ○ Riak ■ Basics, Riak Links & Indexes & Riak Search, Internal features ○ Infinispan ■ Basic features, example, advanced features, indexing & searching ● Memory caches ○ Memcached ● Serialization: Protocol Buffers, Apache Thrift 6 Embedded Stores ● The database system is actually a library ○ One programming language, possibly wrapper in other lang ● We can use it directly in our application ○ It is embedded within the application ● Advantage: ○ Speed: the fastest connection between application and DB ● Disadvantages: ○ Database cannot be distributed ■ actually, embedded database nodes can form a distributed storage ○ Database cannot be shared by two applications 7 Embedded Stores: Representatives ● Embedded local storages ○ LevelDB ■ Local storage for many systems, Log-structured Merge Tree ■ C++ ○ MapDB ■ Java project, one-man show ■ memory-mapped file storage ○ RocksDB ■ Embeddable persistent key-value store ■ Facebook ■ C++, but also connector from Java 8 Representative: LevelDB LevelDB: Basics ● Embedded key-value store (string to string) ○ Using ideas from Google’s BigTable ○ Developers: Jeffrey Dean and Sanjay Ghemawat from Google ● Initial release date: 2011 ● License: New BSD Licence ● Language: C++ ● LevelDB is a backend for Google Chrome’s IndexDB http://github.com/google/leveldb http://db-engines.com/en/system/LevelDB 10 LevelDB: Fundamental Features ● Basic architecture is a LSM Tree (see below) ● Sorted by keys ● Arbitrary byte arrays ● Basic operations: Get(), Put(), Del(), Batch() ● Bi-directional iterators 11 Log-structured Merge Tree Log-structured Merge Tree (LSM Tree) ● data structure for indexed access to data files ○ can handle high write frequency ● writes applied to a sorted structure in memory ○ regularly synchronized to a sorted disk storage ● read ops merge data from memory & disk O'Neil, Patrick E.; Cheng, Edward; Gawlick, Dieter; O'Neil, Elizabeth (June 1996). "The logstructured merge-tree (LSM-tree)". Acta Informatica 33 (4): 351–385. 12 LevelDB: Basic Architecture ● Writes go straight into a log ● Log is flushed to sorted table files (SSTables) ● Reads merge the log and the SSTable files ● Cache speeds up common reads source: https://r.va.gg/presentations/nodejsdub/ 13 Basic Storage: SSTable Files Sorted String Table (SSTable) Files: ● Limited to ~2MB each ● Divided into 4K blocks ● Final block is an index ● Bloom filter used for lookups source: https://r.va.gg/presentations/nodejsdub/ 14 Log: Max size of 4MB then flushed into a set of Level 0 SSTables Level 0: Max of 4 SST files then the files compacted into Level 1 Level 1: Max total size of 10MB then the files compacted into L2 Level 2: Max total size of 100MB then the file compacted into L3 Level 3+: Max total size of 10x previous level size then the files compacted into next level. 0 ↠ 4 SST, 1 ↠ 10M, 2 ↠ 100M, 3 ↠ 1G, 4 ↠ 10G, 5 ↠ 100G, 6 ↠ 1T, 7 ↠ 10T, ... Levels in LevelDB source: https://r.va.gg/presentations/nodejsdub/ 15 LevelDB: Universal Backend ● LevelDB is a popular backend storage for many (distributed) database systems ○ Web browser IndexDB (in Chrome) ○ Riak, Infinispan ○ LevelUp/LevelDown for Node.js ○ etc. 16 Agenda ● Embedded local storages ○ LevelDB ■ Local storage for many systems, Log-structured Merge Tree ● Distributed key-value Stores - representatives ○ Riak ■ Basics, Riak Links & Indexes & Riak Search, Internal features ○ Infinispan ■ Basic features, example, advanced features, indexing & searching ● Memory caches ○ Memcached ● Serialization: Protocol Buffers, Apache Thrift 17 Distributed K-V Store: Riak 18 Riak: Basic Information ● Open source, distributed key-value database ○ Company Basho, first release: 2009 ○ OS: Linux, BSD, Mac OS X, Solaris ● Language: Erlang, C, C++, some parts in JavaScript ● Built-in support for MapReduce ● Provides a full-text search engine on the data ○ “Riak search” basic info: http://db-engines.com/en/system/Riak website: https://riak.com/products/riak-kv/ 19 Riak: Basic Mission ● Availability ○ Riak replicates and retrieves data intelligently so it is available for read/write operations, even in failure conditions ● Fault-Tolerance ○ You can lose access to many nodes due to network partition or hardware failure without losing data ● Operational Simplicity ○ Add new machines to your Riak cluster easily without incurring a larger operational burden ● Scalability ○ Riak automatically distributes data around the cluster and yields a near-linear performance increase as you add capacity source: https://riak.com/ 20 Riak: Basics Terminology in RDBMS vs. Riak namespace of keys ● Stores keys into buckets = a namespace for keys ○ Like tables in a RDBMS, directories in a file system, … ○ Bucket has its own properties ■ n_val – replication factor ■ allow_mult – allowing concurrent updates ■ ... 21 Riak: Interaction with the DB ● Default: HTTP Interface (Web services) ○ GET (retrieve value), PUT (update), DELETE (delete), … ○ example: http://localhost:8098/buckets/test/keys/mykey ● Native Erlang interface ● Connectors from many (not) standard languages ○ C, C#, C++ , Clojure, Dart, Go, Groovy, Haskell, Java, JavaScript, Lisp, Perl, PHP, Python, Ruby, Scala, Smalltalk 22 Riak: Additional Functionality ● Riak can have several types of local storage ○ typically referred to as backends ○ memory, LevelDB, etc. ● Riak has additional functions to work with values ○ Riak links ○ Indexes ○ Riak search 24 Riak: Links ● A way to create relationships between objects ○ Like foreign keys in RDBMS or associations in UML ● Attached to objects via HTTP header “Link” ● Add a book and link to its author: curl -X PUT http://localhost:8098/buckets/books/keys/NoSQL d '{"title": "Big Data a NoSQL databáze", "year": "2015"}' H 'Link: ; riaktag="wrote"' 25 Riak: Link Walking ● Locate a key and then continue by link(s) ○ target specification: /bucket,linktype,[0/1] ● Find the authors who wrote book NoSQL curl -i http://localhost:8098 /buckets/books/keys/NoSQL/authors,wrote,1 ○ Restrict to bucket authors ○ Restrict to tag wrote ○ 1 = include this step to the result 26 Riak: Indexes ● Secondary indexes on the values ○ Search key-value pairs based on the content ● Indexes kept locally on every virtual node ● Types of indexes: 1. integer index (search by value or interval of values) 2. binary index (search by any type of value) 3. fulltext index (Riak search) 27 Riak: Indexes (2) ● Indexes cannot be managed automatically ○ Because there is no schema on the values ● When inserting a value, one can use index ○ In HTTP API, use special HTTP headers curl -X PUT http://localhost:8098/buckets/authors/keys/David -H 'x-riak-index-surname_bin: Novak' -H 'x-riak-index-phone_int: 5062' -d '{"name": "David", surname "Novák", "phone ext": 5062 }' 28 Riak Search: Fulltext via Solr ● Riak provides a distributed, full-text search engine ○ Implemented using Solr (Lucene) ○ Inserted values are indexed automatically ○ ...and then search the data by “terms” ● Key features: ○ Different parsers for different mime types ■ JSON, XML, plain text, … ○ Exact match queries: “Bus” ○ Wildcards: “Bus*”, “Bus?” ○ Prefix matching, proximity searches, range queries... Documentation: https://docs.riak.com/riak/kv/2.2.3/developing/usage/search/index.html 29 Riak: Internal Features ● Let us have a look behind the scene of Riak ○ Consistent hashing ■ and virtual nodes ○ Peer-to-peer (masterless) data replication ○ Read/Write Quorums ○ Hinted handoffs ■ High availability ○ Vector clocks ■ Riak siblings ○ Gossip protocol ○ Query processing ○ Riak Enterprise 31 Consistent Hashing ● Data Partitioning ○ consistent hashing into [0, 2160] ○ data balancing achieved by virtual nodes (vnode) source: http://docs.basho.com/riak/latest/theory/concepts/ 32 P2P Replication ● Data Replication ○ to subsequent nodes ○ replication factor n_val ○ n_val can be set per bucket or per object ○ peer-to-peer “masterless” replication source: http://docs.basho.com/riak/latest/theory/concepts/ 33 Hinted Handoffs source: http://docs.basho.com/riak/latest/theory/concepts/ ● Goal: High availability ● Hinted handoff 1. In case of node failure 2. Neighboring nodes temporarily take over storage operations 3. When the failed node returns, the updates received by the neighboring nodes are handed off to it 34 Vector Clocks ● Any node is able to receive any request ○ We need to know which version of a value is current ● When a value stored, it is tagged with a vector clock curl http://localhost:8098/raw/plans/dinner -X PUT --data "Wednesday" curl -i http://localhost:8098/raw/plans/dinner HTTP/1.1 200 OK X-Riak-Vclock: a85hYGBgzGDKBVIsrLnh3BlMiYx5rAzLJpw7wpcFAA== Content-Type: text/plain Content-Length: 9 Wednesday 35 Vector Clocks (2) source: http://docs.basho.com/riak/latest/theory/concepts/Vector-Clocks/ ● For each update, Riak can determine: ○ Whether one object is a direct descendant of the other ○ Whether the objects are descendants of a common parent ○ Whether the objects are unrelated in recent heritage ● If the objects are unrelated then Riak can: ○ Auto-repair data ○ Provide the data to the user to decide curl -X PUT -H "X-Riak-ClientId: Ben" -H "X-Riak-Vclock: a85hYGBgzGDKBVIsrLnh3BlMiYx5rAzLJpw7wpcFAA==" http://localhost:8098/raw/plans/dinner --data "Tuesday" 36 Riak: Siblings ● Siblings of objects are created in case of: ○ Concurrent writes – two writes occur simultaneously with same vector clock value ○ Stale vector clock – stale v. clock value provided by client ○ Missing vector clock – write without a vector clock ● When retrieving an object we can: ○ Retrieve all siblings ○ Resolve the inconsistency source: https://docs.riak.com/riak/kv/2.2.3/learn/concepts/causal-context.1.html 37 Riak: Request Sharing ● Each node can be a coordinating vnode = node responsible for a request ○ Finds the vnode for the key according to hash ○ Finds vnodes where other replicas are stored – next N-1 nodes ○ Sends a request to all vnodes ○ Waits until enough requests returned the data ■ To fulfill the read/write quorum ○ Returns the result to the client 39 Riak Enterprise ● Commercial extension of Riak ● Adds support for: ○ Multi-datacenter replication ■ Using more clusters and replication between them ■ Real-time replication – incremental synchronization ■ Full-sync replication – entire data set is synchronized ○ SNMP monitoring ■ Simple Network Management Protocol ○ JMX monitoring ■ Java Management Extensions 40 Distributed K-V Store: Infinispan 41 Basics ● Developer: Red Hat, open source community ○ Originally developed as a memory-based cache for JBoss ● Initial release date: 2009, current version 12.1 ● License: Apache version 2 ● Language: Java ○ embedding to Java application OR ○ external service via various APIs (REST service, Memcached protocol, Hot Rod) OR ○ connectors: Groovy, Scala http://infinispan.org/ http://db-engines.com/en/system/Infinispan 43 Infinispan: Hello World public static void main(String args[]){ Cache store = new DefaultCacheManager().getCache(); store.put("key1", new MyClass("value1")); store.put("key2", "value2"); if (store.containsKey("key1")) { Object result = store.get("key2"); store.removeAsync("key2"); } store.replaceAsync("key2", "value3"); store.clear(); } 44 Infinispan: Features (1) ● Running in cluster ○ auto-sharding (distribution mode) ■ basically “consistent-hashing” (customizable) ■ fixed number of “segments” (like “vnodes” in Riak) ○ replication - master/slave (primary/backup owners) ■ synchronous (write through), asynchronous (write back/behind) ● Persistence ○ originally only memory-based, now fully configurable ■ file system store, JDBC store, LevelDB, JPA cache store,... ○ JBoss marshalling (serialization) of Java objects 45 Infinispan: Features (2) ● Cache features ○ eviction/expiration (remove objects automatically) ■ either when the cache is full (LRU) ■ or after some time (lifespan of an entry) ○ invalidation mode ■ a special type of cluster mode ■ when a value changes, other nodes are informed that their data is stale ○ L1 cache ■ each node keeps a local cache of key/values retrieved from other nodes ● MapReduce ○ full support of MapReduce processing ■ very efficient since version 7.0 46 Concurrent Operations ● Full transactional processing ○ Java Transaction API (JTA) ○ X/Open Extended Architecture (X/Open XA) ○ optimistic vs. pessimistic transactions ○ deadlock detection ○ Two-phase commit protocol (2PC) ● Distributed Execution Framework ○ executing a “Callable” on “nodes storing given set of keys” ○ compatible with standard Java Execution Framework 47 Concurrent Operations (2) ● Multi-version Concurrency Control (MVCC) ○ a technique to solve concurrent access to data ○ faster than strict use of r/w locks ○ popular in many (RDBMS) databases ● For transactions, user can choose isolation levels: ○ READ_UNCOMMITED ■ don’t use transactions at all ○ READ_COMMITED (default) ■ any transaction does see new value immediately after its commit ○ REPEATABLE_READS ■ using MVCC, the transaction does see the same values all the time 48 Infinispan: Querying ● Additional indexes ○ to provide search over stored values ○ using Hibernate Search technology ○ ...and Lucene ● Vice versa: ○ Infinispan can serve as a distributed storage for Lucene source: http://infinispan.org/docs/7.0.x/user_guide/user_guide.html 49 Example: Indexing // A class to be indexed is annotated with @Indexed // then you pick which fields and how to index them @Indexed public class Book { @Field String title; @Field String description; @Field @DateBridge(resolution=YEAR) Date publicationYear; @IndexedEmbedded Set<> authors = new HashSet(); } public class Author { @Field String name; @Field String surname; } source: https://infinispan.org/tutorials/ 50 Example: Searching SearchManager searchManager = Search.getSearchManager(store); // create a query via Lucene APIs or using builder QueryBuilder qBuilder = searchManager.buildQueryBuilderForClass(Book.class).get(); Query luceneQ = qBuilder.phrase() .onField("description").andField("title") .sentence("book on scalable query engines").createQuery(); CacheQuery res = searchManager.getQuery(luceneQ, Book.class); // and there are your results! List objectList = res.list(); source: https://infinispan.org/tutorials/ Task: Find books on "book on scalable query engines" 51 Agenda ● Embedded local storages ○ LevelDB ■ Local storage for many systems, Log-structured Merge Tree ● Distributed key-value Stores - representatives ○ Riak ■ Basics, Riak Links & Indexes & Riak Search, Internal features ○ Infinispan ■ Basic features, example, advanced features, indexing & searching ● Memory caches ○ Memcached ● Serialization: Protocol Buffers, Apache Thrift 52 Memory Caches The typical cache systems are: ● In-memory, distributed key-value stores ● Can be used to speed-up: 1. Web access to your system 2. Data access from different components of your system ● Typical features: ○ Limited size, FIFO or LRU algorithms ○ Limited validity of the key-value pair (e.g., 1 hour) 53 Memory Caches: Representatives ● Memcached ○ 2003, very popular ○ used by FB in early years (MySQL + Memcached) ● Ehcache ○ Java, compatible with javax.cache API ○ Directly storing Java objects into cache ● Hazelcast ○ In-memory data grid written in Java ○ Data evenly distributed among nodes in the cluster 54 Memcached: Basic Info ● In-memory distributed key-value store ● Initial release date: 2003 ○ by Brad Fitzpatrick for LiveJournal ● License: New BSD Licence ● Language: C ● Used by: ○ LiveJournal, Wikipedia, Flickr, WordPress.com, Craigslist https://memcached.org/ http://db-engines.com/en/system/Memcached 55 Memcached: Features ● Memcached ○ store small chunks of arbitrary data (strings, objects) ○ keys up to 250 bytes, values up to 1MB ● Typical usage ○ cache results of database calls, API calls, or page rendering ● API is available for most popular languages 56 Memcached: Architecture ● Client-server architecture ○ Client-side libraries to contact the servers ○ Each client knows all servers ○ Servers do not communicate with each other ● Static sharding ○ The client computes a hash(key) to determine the server ○ Scalable shared-nothing architecture across the servers 57 Agenda ● Embedded local storages ○ LevelDB ■ Local storage for many systems, Log-structured Merge Tree ● Distributed key-value Stores - representatives ○ Riak ■ Basics, Riak Links & Indexes & Riak Search, Internal features ○ Infinispan ■ Basic features, example, advanced features, indexing & searching ● Memory caches ○ Memcached ● Serialization: Protocol Buffers, Apache Thrift 58 Data Formats: Text Data ● Structured Text Data ○ JSON, BSON (Binary JSON) ■ JSON is currently number one data format used on the Web ○ XML: eXtensible Markup Language ○ RDF: Resource Description Framework 59 Data Formats: Binary Data ● Data objects to be stored often originate from memory structures (objects, class instances) ● Before storing, these objects must be serialized ○ Key-value stores can store a binary value ● Serialization (marshalling) can be done ○ By your own proprietary (de)serializator ○ Using “standard” language-specific way (Java serialization) ○ Using a cross-language standard: ProtoBuf, Apache Thrift, Apache Avro 60 Protocol Buffers ● Technique for serializing structured data ● Developed by Google since 2008 ○ BSD Licence ● Philosophy: 1. Define the structure of the data ■ Using an ProtoBuf interface description language 2. Automatically create source code in multiple programming languages for (de)serialization of such data ■ Compilers for Java, C++, Python, JavaScript, PHP, … http://en.wikipedia.org/wiki/Protocol_Buffers 61 Protocol Buffers: Example // file: addressbook.proto message Person { required string name = 1; required int32 id = 2; optional string email = 3; enum PhoneType { MOBILE = 0; HOME = 1; WORK = 2; } message PhoneNumber { required string number = 1; optional PhoneType type = 2 [default = HOME]; } repeated PhoneNumber phone = 4; } message AddressBook { repeated Person person = 1; } source: https://developers.google.com/protocol-buffers/ 62 Protocol Buffers: Example 2 - Java ● Compile this source by: protoc --java_out=jdir addressbook.proto protoc --cpp_out=cppdir addressbook.proto protoc --python_out=pdir addressbook.proto ● Result looks like this (Java): ○ you have getters; builder with setters; writeTo(outstream) https://github.com/jgilfelt/android-protobuf- example/blob/master/src/com/example/tutorial/AddressBookProtos.java source: https://developers.google.com/protocol-buffers/ 63 Apache Thrift ● Interface definition language + binary communication protocol ● Developed by Facebook -> open source (Apache) ● Similar philosophy as ProtoBuf ○ Write data schema once ○ Generate code in multiple languages ● Many languages: C#, C++, Erlang, Go, Haskell, Java, Node.js, OCaml, Perl, PHP, Python, Ruby, Smalltalk https://thrift.apache.org/ http://en.wikipedia.org/wiki/Apache_Thrift 64 Apache Thrift: Example enum PhoneType { HOME, WORK, MOBILE, OTHER } struct Phone { 1: i32 id, 2: string number, 3: PhoneType type } source: http://en.wikipedia.org/wiki/Apache_Thrift 65 Apache Thrift: Example source: https://thrift.apache.org/ 66 service PhoneBook extends shared.SharedService { i32 add(1:string num, 2:PhoneType type), void remove(1:i32 id), oneway void sms(1:string num, 2:string msg) } Apache Avro ● a row-oriented remote procedure call and ● data serialization framework ● serializes data in a compact binary format ● does not require running a code-generation program when a schema changes 67source: https://en.wikipedia.org/wiki/Apache_Avro Apache Avro: Schema Definition 68source: https://en.wikipedia.org/wiki/Apache_Avro { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["null", "int"]}, {"name": "favorite_color", "type": ["null", "string"]} ] } ● primitive types ○ null, boolean, int, long, float, double, bytes, and string ● complex types ○ record, enum, array, map, union, and fixed Lecture Summary ● Key-value stores are popular for its simplicity and efficiency ● Most of the real key-value stores provide additional functionality to search on the values ● Besides distributed systems, there are local embedded stores and in-memory caches ● There are general frameworks to provide serialization of objects into binary data 69 References ● I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015. 288 p. ● Sadalage, P. J., & Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Addison-Wesley Professional, 192 p. ● doc. RNDr. Irena Holubova, Ph.D. MMF UK course NDBI040: Big Data Management and NoSQL Databases ● http://www.slideshare.net/quipo/nosql-databases-why- what-and-when ● https://riak.com/products/riak-kv/ ● https://infinispan.org/docs/stable/index.html 70