Distributed Key-value Stores Lecture 4 of NoSQL Databases (PA195) David Novak & Vlastislav Dohnal Faculty of Informatics, Masaryk University, Brno Agenda ● Fundamentals of Key-value Stores ○ Basic Example: Riak ● Key Techniques of Many Key-value Stores ○ Data Sharding: Consistent hashing + virtual nodes ○ Replica management & consistency (version stamps) ○ Gossip protocols (distributed management of nodes) ○ Transactions: Two-phase commit protocol (2PC); MVCC ● Comparison of K-V Stores and Applicability ○ Features to consider - basic, advanced ○ Modes of communication with the database ○ When (not) to use Key-value Stores 2 Key-value Stores: Basics ● A simple hash table (map), primarily used when all accesses to the database are via primary key ○ key-value mapping ● In RDBMS world: A table with two columns: ○ ID column (primary key) ○ DATA column storing the value (unstructured BLOB) ● Basic operations: ○ Put a value for a key put(key, value) ○ Get the value for the key value:= get(key) ○ Delete a key-value delete(key) 3 Querying ● We can query by the key ● To query using some attribute of the value is not possible (in general) ○ We need to read the value to test any query condition ● What if we do not know the key? ○ Some systems support additional functionality ■ Using some kind of additional index (e.g., full text) ■ The data must be indexed first ■ Example later: Riak search 4 Representatives Project Voldemort Ranked list: http://db-engines.com/en/ranking/key-value+store 5 Riak: Basic Information ● Developer: Basho, open source community ○ there is a company behind ● Initial release date: 2009 ○ it is not a new (shaky) technology ● License: Apache 2 + commercial enterprise ○ for free, but with option to have a support ● Language: Erlang, C, C++, some parts in JavaScript ○ Efficient; not possible to embed to e.g. Java application ● Server OS: Linux, BSD, Mac OS X, Solaris basic info: http://db-engines.com/en/system/Riak website: https://riak.com/products/riak-kv/ 6 Riak: HTTP REST API ● Riak HTTP API ○ HTTP Restful service (aka HTTP REST) ○ a simple way to define Web services ● Listening on a port and providing services http://localhost:8098/ ● Such interface can be directly called from ■ application written in any language ■ client side of the application (e.g., AJAX request) ■ command line (simple scripts), … 7 Riak: Basic Operations ● We will use curl -X method URL -d data ○ command line tool to communicate with server (HTTP(S),...) curl -X PUT http://localhost:8098/buckets/authors/keys/David -d '{"name": "David Novák", "affiliation": "MU"}' curl -X GET http://localhost:8098/buckets/authors/keys/David {"name": "David Novák", "affiliation": "MU"} curl -X DELETE http://localhost:8098/buckets/authors/keys/David 8 Management of the Keys ● How to design the key? ○ Provided by the user (natural unique key): ■ shopping cart data (user ID) ■ web session data (with the session ID as the key) ■ user profiles (user ID), … ○ Generated by some algorithm ○ Derived from time-stamps (or other data) ● Expiration of keys ○ After a certain time interval ■ e.g., for caches, session/shopping cart objects,... 9 ● e.g., Riak defines a location for each value location = Keys: Buckets (Namespaces) ● keys can be grouped into buckets (namespaces) ○ division of the key space ○ logical differentiation of records by types ○ but physically, all keys are hashed into the same space put( , value ) value := get( ) delete( ) 10 Namespaces: Example userID userProfile sessionData shoppingCart ● item 1 ● item 2 Namespace User Key: Value: Namespace UserProfiles userID userProfile Key: Value: Namespace Sessions userID sessionData Key: Value: userID shoppingCart ● item 1 ● item 2 Namespace ShoppingCart Key: Value: Single namespace: ● works well, if the application often wants to access all data ● this is an example of the aggregates-based data modelling 11 Key-value Stores: The Beginning ● 2007: Amazon Dynamo paper ○ DeCandia, G. et al. (2007). Dynamo: Amazon’s Highly Available Key-value Store. ACM SIGOPS Operating Systems Review, 41(6), pp 205–220. ■ http://dl.acm.org/citation.cfm?id=1294281 ● Amazon Dynamo: first fully-fledged distributed key-value store ● Now: DynamoDB - available as paid service ○ http://aws.amazon.com/dynamodb/ 12 Dynamo and Other Systems ● Amazon Dynamo paper ○ Defined the fundamental challenges and their solutions ○ Laid the foundations for other systems ● Other Key-value Stores ○ Each has a slightly different purpose ■ The set of challenges differs ○ Each has a specific set of solutions of these challenges ○ The additional functionality may differ ■ Besides basic put/get/delete operations 13 Selected Challenges & Solutions Challenge Selected Techniques Data partitioning (sharding) Consistent hashing Read scalability & reliability Data replication Replica management Version stamps, vector clocks Detection of a node join/leave/failure Gossip protocol (no centralized registry of nodes’ membership and liveness) Concurrency, transactions Two-phase commit protocol, MVCC 14 Data Sharding: Consistent hashing + virtual nodes Key Techniques of Key-value Stores node(key) = hash(key) mod (#_of_nodes - 1) node(key) = hash(key) mod (#_of_nodes) node 4 Sharding: Modulo Hashing ● We want to use hash(key) for partitioning of key-value pairs to nodes (auto sharding) node 4node 1 node 2 node 3 ? ● Recalculate the hashes of all objects, if #_of_nodes changes ○ and migrate practically all data objects to different nodes ● Standard modulo-based hashing: 16 node B Consistent Hashing: Principles Use the same hash function for data and nodes hash: Keys->[0,2n]Hash value space (ring) node A node B node E node D node C 2n = 0 node “B” is responsible for interval [A, B] node “C” is responsible for interval [A, C] (only data in [A,B] range need relocation) For each hash value, the next clockwise node is “responsible” 17 Sharding by Hashing ● Consistent hashing ○ is used in massively distributed systems (like Riak) ● Modulo-based hashing ○ is also used, e.g., in Solr and Lucene ● Modulo hashing is good for keeping data balanced ○ consistent hashing cannot guarantee balanced data ■ especially for low number of nodes ○ it must use different techniques to achieve balancing 18 Consistent Hash: Data Balancing Virtual nodes: ● Q equal-sized partitions (virtual nodes) ● S physical nodes ● Q/S partitions per node ● assumed: Q >> S ● result: balanced distribution of data to physical nodes source: https://docs.riak.com/riak/kv/2.2.3/learn/concepts/vnodes.1.html 19 Replica management & consistency Key Techniques of Key-value Stores Consistent Hash: Data Replication ● Each object stored at N consecutive nodes ● Possible both master-slave or peer-to-peer replication ● master-slave is OK for read-intensive applications source: https://docs.riak.com/riak/kv/2.2.3/learn/concepts/index.html 21 P2P Replication: Consistency ● Recall the concept of quorum ○ N = replication factor (typical default: N = 3) ○ W = data must be written at least at W nodes ○ R = data must be read at least from R nodes W > N/2 R + W > N ● Example: replication factor N = 5, quora W = 3 ○ Write is reported as successful only when reported as a successful on >= 3 nodes ○ Tolerate N – W = 2 nodes being down for write operations 22 Quora per Operation ● The R/W values can be often set per operation ○ Riak: all / one / quorum / an integer value ○ it is a way to tune efficiency/availability vs. consistency ● example: N = 3, quora W = 2, R = 2 ○ value for key1 is stored on nodeA, nodeB, nodeC ○ at least two of them always have the newest value ■ and operation get(key1) will always get the newest value ● we can set R = 1 for operation value:= get(key1) ○ meaning: get the value the from any replica, e.g., nodeB ○ even though nodeA and nodeC may have a newer value 23 A Problem to Solve Let’s assume the peer-to-peer replication... ○ with write/read quora ...we need to have a mechanisms to: 1. recognize which value for a single record is newer 2. find out that two write operations are concurrent and causing a write-write conflict 24 SYNC (key, value2)(key, value3)(key, value3) (key, value5)(key, value1)(key, value2)(key, value2)(key, value3)(key, value3) (key, value2)(key, value3)(key, value3)(key, value4)(key, value1)(key, value2) CONFLICT? (key, value1) P2P Replication Conflict Example three nodes, initial state: (key, value1) blue red green ● single update, then sync ● next update on green ○ how to find out which value is newer? ● two simultaneous updates ○ how to find out that it is a conflict? ● how to find out which value is newer? 25 Family of techniques: avoid/detect update conflicts Version Stamps ● Version stamp in general: ○ A field created for each record ○ The stamp changes every time the data record changes ● Basic usage (also in centralized system): ○ A client reads the stamp together with the record ○ When later updating the record, the stamp is sent back together with the new value and checked ○ If the stamp differs from the actual stamp => conflict 26 Constructing Version Stamps There are several ways to construct the stamps: 1. counter - incremented after each record update ○ pros: it is clear, which version is newer ○ cons: duplications must be avoided (single master?) 2. GUID - a large unique random number ○ pros: anybody can generate them (client) ○ cons: cannot be checked for recentness 3. Hash from the data ○ pros: anybody can generate it, is deterministic ○ cons: cannot be checked for recentness 27 Constructing Version Stamps (2) 4. Timestamps ○ pros: recentness like counters, a single master not needed ○ cons: clock synchronization, sufficient granularity needed Combination is worth: ● counter + hash: ○ counter = recentness comparison ○ hash = if two updates appear concurrently on two servers (with the same counter), the hash identifies the conflict 28 Version Stamps on Multiple Nodes ● If there is a single master, everything works well ● Peer-to-peer replication: ○ Any peer can process update ○ When contacted for update, the peer must reply to the client immediately after storing the new value ■ it cannot wait until all peers commit the update (2-phase commit protocol) ● Objective: A distributed algorithm that would ○ reliably detect write-write conflict ○ balance between write performance and conflict prevention ■ or allow the user to balance it 29 Vector Stamps Algorithms Vector stamps ○ Family of algorithms for generating a partial ordering of events in a distributed system and detecting “conflicts”. ○ Each node has its own counter ■ for each data item ■ The node’s counter increments when its value is updated ○ Each node keeps a counter vector with counters of all nodes ■ The nodes exchange their values ○ Each node uses the counter vectors to determine ■ which value is new ■ if there is a conflict 30 (key, value2) [blue: 2, green: 1, red: 1] (key, value3) [blue: 2, green: 2, red: 1] (key, value3) [blue: 2, green: 2, red: 1] (key, value5) [blue: 2, green: 2, red: 2] (key, value1) [blue: 1, green: 1, red: 1] (key, value2) [blue: 2, green: 1, red: 1] (key, value2) [blue: 2, green: 1, red: 1] (key, value3) [blue: 2, green: 2, red: 1] (key, value3) [blue: 2, green: 2, red: 1] (key, value2) [blue: 2, green: 1, red: 1] (key, value3) [blue: 2, green: 2, red: 1] (key, value3) [blue: 2, green: 2, red: 1] (key, value4) [blue: 3, green: 2, red: 1] (key, value1) [blue: 1, green: 1, red: 1] (key, value2) [blue: 2, green: 1, red: 1] CONFLICT (key, value1) [blue: 1, green: 1, red: 1] Vector Stamps: Example three nodes, initial state: (key, value1) [blue: 1, green: 1, red: 1] blue red green ● single update, then sync - the stamp order is clear ○ [blue: 1, green: 1, red: 1] is older than [blue: 2, green: 1, red: 1] SYNC ● next update on green ● two simultaneous updates ○ [blue: 3, green: 2, red: 1] cannot be compared to [blue: 2, green: 2, red: 2] 31 Vector Stamps: Specific Techniques ● Specific techniques differ in the way they communicate ● Widely used techniques (and their variants) ○ Lamport timestamps (1987) ○ Vector clocks (used by Dynamo, etc.) ■ counters updated whenever nodes communicate ■ last value can be retrieved only during reads ○ Version vectors ○ Matrix clocks ○ ... 32 Conflict Resolution ● There are three general ways to resolve conflicts ○ (reconcile differences between copies of distributed data) ○ this process is often known as anti-entropy 1. Write repair ○ The correction takes place during a write operation 2. Read repair ○ The correction is done when a read finds an inconsistency ■ Optimistic strategy, read operation is slowed down 3. Asynchronous repair ○ The correction is done as separate operations ○ AKA active “anti-entropy” 33 Gossip Protocols A set of distributed protocols ● Each node periodically sends its current info ○ To a randomly-selected peer ○ The peers keep the newer info In distributed NoSQL databases, gossip is used for ● Spreading information about current state ○ of the entering/leaving/failing nodes ○ asynchronous reconciling of conflicts (anti-entropy) ○ other properties, ... 34 Distributed Transactions Key Techniques of Key-value Stores Transactions Transaction = a sequence of atomic operations that form one logical operation on the database. ● Some of the distributed key-value stores enable full transactional processing ● The following techniques are key: ○ Two-phase commit protocol (2PC) ■ Atomicity of transaction: either all operations (commit) or none (rollback) ○ Multi-version concurrency control (MVCC) ■ Levels of isolation of transactions 36 Levels of Isolation (1) ● The transactions should be “isolated” ○ Isolation: property that defines how/when results of one operation become visible to other concurrent operations. ● We recognize four levels of isolation ○ We talk about different "read phenomena" (see below) 37 1. READ UNCOMMITTED: Operation can access uncommitted changes made by other transactions. ○ It suffers phenomenon "dirty read" - a transaction reads uncommitted values that is later rolled back. Levels of Isolation (2) 2. READ COMMITTED: If one transaction commits a value, other transactions will read it immediately. ○ It suffers phenomenon "non-repeatable read" - if a transaction reads the same record twice, the second read has a different result. 38 3. REPEATABLE READS: Multiple reads of the same record/key issued within the same transaction will always return the same value. ○ It suffers phenomenon "phantom read" - if a transaction does two identical "range queries", and the collection of rows returned by the second query differs from the first. Levels of Isolation (3) 4. SERIALIZABLE: All transactions occur in a completely isolated fashion as if executed serially. ○ None of the read phenomena may occur. 39 Levels of Isolation (4) source: http://en.wikipedia.org/wiki/Isolation_(database_systems) Isolation level Dirty reads Non-repeatable reads Phantoms Read Uncommitted may occur may occur may occur Read Committed - may occur may occur Repeatable Read - - may occur Serializable - - - 40 Snapshot isolation is still not the serializable level! Assume two records in a table black and white. T1 changes all blacks into whites and T2 vice versa. If you run them simultanously, in snapshot isolation you will end up with swapped colors. In serializable, you will have either all blacks or all whites! Multi-version Concurrency Control ● Multi-version Concurrency Control (MVCC) ○ a technique to solve concurrent access to data ○ faster than strict use of r/w locks ○ popular in many (RDBMS) databases ○ if one transaction is writing and the other is reading, the system can create another version of the data ■ each transaction sees a snapshot of the data at a particular instant in time source: http://en.wikipedia.org/wiki/Multiversion_concurrency_control 41 MVCC: Example source: http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 42 MVCC and Isolation Levels ● MVCC cannot ensure full SERIALIZABILITY ○ skew write (write skew anomaly) ■ two transactions concurrently read an overlapping data set, concurrently make disjoint updates and finally concurrently commit, neither having seen the update performed by the other ● For instance, in Infinispan user can choose ○ READ_UNCOMMITED ■ don’t use transactions at all ○ READ_COMMITED (default) ○ REPEATABLE_READS ■ using some version of JBoss MVCCEntry source: http://en.wikipedia.org/wiki/Snapshot_isolation http://infinispan.org/docs/7.0.x/user_guide/user_guide.html 43 Two-phase Commit Protocol ● 2PC: Distributed algorithm ○ coordinating all participants in a distributed transaction ○ on whether to commit or abort (roll back) the transaction ■ it’s a special type of consensus protocol source: http://en.wikipedia.org/wiki/2PC http://www.slideshare.net/quipo/nosql-databases-why-what-and-when coordinator participants 1. Commit request phase (voting phase) query to commit coordinator 1. Execute transactions up to COMMIT phase 2. Write entry to UNDO and REDO logs Agree (YES) or Abort (NO) 2. Commit phase commit 1. Complete operation 2. Release locks acknowledge a. SUCCESS (agreement from all) b. FAILURE (abort from any) complete transaction roll back 1. Undo operation 2. Release locks undo transaction 44 Comparison of K-V Stores & Applicability K-V Stores: Suitable Use Cases ● Storing Web Session Information ○ Every web session is assigned a unique session_id value ○ Everything about the session can be stored by a single PUT request or retrieved using a single GET ○ Fast, everything is stored in a single object ● User Profiles, Preferences ○ Every user has a unique user_id/user_name + preferences (language, time zone, design, access rights, … ) ○ As in the previous case: Fast, single object, single GET/PUT ● Shopping Cart Data ○ Similar to the previous cases 46 K-V Stores: When Not to Use ● Relationships among Data ○ Relationships between different sets of data ■ Some key-value stores provide link-walking features ● Multi-operation Transactions ○ Saving multiple keys ■ Failure to save any of them → revert or roll back the rest of the operations ● Query by Data ○ Search the keys based on something found in the value part ■ Additional indexes needed (some stores provide them) ● Operations by Key Sets ○ Operations are limited to one key at a time ■ No way to operate upon multiple keys at the same time 47 K-V Stores: Features & Differences Dozens of key-value stores - how to choose? 1. Basic information ○ programming language, license etc. 2. Internal Features ○ how are certain principles implemented ○ which influences performance/security/reliability/etc. 3. Advanced (User-visible) Features ○ what “advanced” features does the store provide ■ besides store/get/delete operations 48 Basic Information ● Developer ○ important information, if there is a company behind ● Initial release date ○ is it a hot new (shaky) technology or more stable one? ● License ○ Open Source - GNU GPL, Apache, supported version? ● Implementation language ○ most often: C/C++, Java, Erlang ● Server operating systems ○ usually: Linux, BSD (OS X, MS Windows) 49 Internal Features (1) ● Durability ○ if the system supports data persistence ○ and how is it done (storage models) ● Data Partitioning (Sharding) ○ if the system supports (semi)-automatic data sharding ● Data Replication ○ replication can speed up read/write and provide reliability ○ master-slave, P2P, R/W quora, version control, etc. 50 Internal Features (2) ● Concurrency control ○ does the system allow concurrent accesses ○ and how is it managed (solved conflicts) ● Cluster topology management ○ is there a centralized repository of participating nodes ○ or some Gossip protocol ● Node/communication failure management ○ how fault-tolerant is the system ○ permanent failure recovery ● User concept ○ any support for user-based access control 51 Advanced Features (1) ● Data model ○ Is the “value” really an unstructured BLOB ○ or are some advanced structures supported ■ e.g. Redis: strings, hashes, lists, sets and sorted sets ● Secondary Indexes ○ for efficient access of the data by the values ○ e.g. interval indexes, Lucene-like indexes for full-text search ● Foreign keys ○ or other links between data (keys and values) 52 Advanced Features (2) ● Distributed Transactions Management ○ is implemented any concept of transaction management ○ like X/Open XA (eXtended Architecture) ● Map-Reduce processing ○ is available some distributed operation execution like M-R ● Triggers ○ procedures started automatically when something happens 53 Communication Modes There are three basic communication modes 1. Via some web-service interface ○ HTTP REST service, SOAP ○ usually fast, callable from many clients/libraries/languages 2. Specific language connector ○ library in the language of my application ○ may be slower but comfortable 3. Embedded to my application ○ the database system runs within the application process ○ requires compatible (the same) programming language 54 References ● I. Holubová, J. Kosek, K. Minařík, D. Novák. Big Data a NoSQL databáze. Praha: Grada Publishing, 2015. 288 p. ● Sadalage, P. J., & Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Addison-Wesley Professional, 192 p. ● doc. RNDr. Irena Holubova, Ph.D. MMF UK course NDBI040: Big Data Management and NoSQL Databases 55