Apache Hadoop Ecosystem Other available tools – inspired by • Apache Hadoop project – inspired by Google's MapReduce and Google File System papers. • Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware • Open Source Software + Hardware Commodity – IT Costs Reduction ©2011 Cloudera, Inc. All Rights Reserved. Hadoop Core HDFS MapReduce ©2011 Cloudera, Inc. All Rights Reserved. HDFS • Hadoop Distributed File System • Redundancy • Fault Tolerant • Scalable • Self Healing • Write Once, Read Many Times • Java API • Command Line Tool ©2011 Cloudera, Inc. All Rights Reserved. MapReduce 5 • Two Phases of Functional Programming • Redundancy • Fault Tolerant • Scalable • Self Healing • Java API ©2011 Cloudera, Inc. All Rights Reserved. Hadoop Core 6 HDFS MapReduce Java Java Java Java Word Count Example Key: offset Value: line Key: word Value: count Key: word Value: sum of count Why do these tools exist? • MapReduce is very powerful, but can be awkward to master • These tools allow programmers who are familiar with other programming styles to take advantage of the power of MapReduce Apache Hadoop Ecosystem Apache Hadoop Ecosystem The Ecosystem is the System • Hadoop has become the kernel of the distributed operating system for Big Data • No one uses the kernel alone • Hadoop Ecosystem = a collection of projects at Apache Apache Hadoop Ecosystem MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining) Zookeeper – Coordination Framework MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining) What is ZooKeeper? • A centralized service for maintaining – Configuration information – Naming – Providing distributed synchronization – etc. • A set of tools to build distributed applications that can safely handle partial failures – They don‘t need to implement them on their own • ZooKeeper was designed to store coordination data – Status information – Configuration – Location information Flume / Sqoop – Data Integration Framework MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining) What’s the problem for data collection? • Collecting data (to analyze) from distributed systems requires many mechanisms to provide reliability, security, … – These don‘t need to be implemented by each application • Apache Flume – a system used for moving massive quantities of streaming data into HDFS • eg., collecting log data present in log files from web servers and aggregating it in HDFS for analysis How can Flume help? • A distributed data collection service • It‘s efficiently collecting, aggregating, and moving large amounts of data • Supports data encryption (SSL/TLS) • Fault tolerant, many failover and recovery mechanisms • One-stop solution for data collection of all formats Flume: High-Level Overview • Logical Node • Source • Sink • Channel (passive store) ©2011 Cloudera, Inc. All Rights Reserved. Flume Architecture Log Flume Node Log Flume Node ... HDFS ©2011 Cloudera, Inc. All Rights Reserved. Flume Sources and Sinks • Local Files • HDFS • Stdin, Stdout • Twitter • IRC • IMAP • NetCat UDP source • Syslog • Custom sources • … Sqoop • Apache Sqoop (= Sql+Hadoop) – Tool for efficient transferring bulk data between Apache Hadoop and structured datastores such as relational databases – Easy, parallel database import/export – What you want to do? • Insert data from RDBMS to HDFS • Export data from HDFS back into RDBMS ©2011 Cloudera, Inc. All Rights Reserved. Sqoop 20 RDBMS Sqoop HDFS ©2011 Cloudera, Inc. All Rights Reserved. Sqoop Examples 21 $ sqoop import --connect jdbc:mysql://localhost/world -- username root --table City ... $ hadoop fs -cat City/part-m-00000 1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,He rat,AFG,Herat,1868004,Mazar-e- Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200 ... Pig / Hive – Analytical Language MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining) Why Hive and Pig? • Although MapReduce is very powerful, it can also be complex to master • Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code • Many organizations have programmers who are skilled at writing code in scripting languages • Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce – Hive was initially developed at Facebook, Pig at Yahoo! Hive – Developed by Apache Hive – An SQL-like interface to Hadoop • Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop – MapRuduce for execution – HDFS for storage • Hive Query Language – Basic-SQL : Select, From, Join, Group-By – Equi-Join, Muti-Table Insert, Multi-Group-By – Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid ©2011 Cloudera, Inc. All Rights Reserved. Hive 25 MapReduce Hive SQL Pig Apache Pig • A high-level scripting language (Pig Latin) • Allows to simply write MapReduce programs – their structure is amenable to substantial parallelization • allow for large data processing – optimizes automatically • allows the user to focus on semantics rather than efficiency • Crucial features – easy to understand, easy to debug – extendable (new functions) – allows for high optimilization A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’ – Initiated by ©2011 Cloudera, Inc. All Rights Reserved. Pig MapReduce Pig Script Hive vs. Pig Hive Pig Language HiveQL (SQL-like) Pig Latin, a scripting language Schema Table definitions that are stored in a metastore A schema is optionally defined at runtime ProgrammaitAccess JDBC, ODBC PigServer • Input • For the given sample input the map emits • the reduce just sums up the values Hello World Bye World Hello Hadoop Goodbye Hadoop < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> WordCount Example WordCount Example In MapReduce public class WordCount { public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } } WordCount Example By Pig A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C; WordCount Example By Hive CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH ’wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token; ©2011 Cloudera, Inc. All Rights Reserved. The Story So Far … RDBMS Hive Pig Sqoop MapReduce HDFS FSSQL SQL Script Posix Java Java Flume Hbase – Column NoSQL DB MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining) Structured-data vs Raw-data I – Inspired by Apache HBase • Open-source Apache project • Non-relational, distributed Database • Runs on top of HDFS • Modeled after Google’s BigTable technology • Written in Java • NoSQL (Not Only SQL) Database • Consistent and Partition tolerant • Runs on commodity hardware • Able to host Very Large Databases (terabytes to petabytes) • billions of rows & millions of columns atop clusters of commodity hardware • Low latency random read / write to HDFS • Many companies are using HBase • Facebook, Twitter,Adobe, Mozilla, Yahoo!, Trend Micro, and StumbleUpon I – Inspired by Apache HBase is NOT • A direct replacement for RDBMS • ACID (Atomicity, Consistency, Isolation, and Durability) complaint • HBase provides row-level atomicity • a scan is NOT consistent view of a table (neither isolated) • all visible data is also durable data Relational Database vs HBase • Hardware – Expensive Enterprise multiprocessor systems – Same as Hadoop • Fault Tolerance – RDBMS are configured with high availability. Server down time intolerable. – Built into the architecture. Individual Node failure does not impact overall performance. • Database Size – RDBMS can hold upto TBs (Tera bytes) – Hbase can hold PBs (Peta bytes) • Data Layout – RDBMS are rows and columns oriented – Hbase is Column oriented Relational Database vs HBase • Data Type – Rich data type. – Bytes • Transactions – Fully ACID complaint. – ACID on single row only. • Indexes – PK, FK and other indexes. – Sorted Row-key (not a real index) HBase – workflow HBase – Fault Tolerance • What if region server dies? – The HBase master will assign a new regionserver • What if master dies? – The back up master will take over • What if the backup master dies? – You are dead • Replication of Data – HBase achieves this using HDFS replication mechanism • Failure Detection – Zookeeper is used for identifying failed region servers HBase – Data Model • No Schema • Table – Row-key must be unique – Rows are formed by one or more columns – Columns are grouped into Column Families – Column Families must be defined at table creation time – Any number of Columns per column family – Columns can be added on the fly – Columns can be NULL • NULL columns are NOT stored (free of cost) • Column only exist when inserted (Sparse) • Cell – Row Key, Column Family, Qualifier , Timestamp / Version • Data represented in byte array – Table name, Column Family name, Column name HBase – Data Model • Cells are “versioned” • Table rows are sorted by row key • Region – a row range [start-key:end-key] HBase – Logical View of Data ID (pk) First Name Last Name tweet Timestamp 1234 John Smith hello 20130710 5678 Joe Brown xyz 20120825 5678 Joe Brown zzz 20130916 Row key Value (Column Family, Qualifier, Version) 1234 Info{‘lastName’: ‘Smith’, ‘firstName’:’John’} pwd{‘tweet’:’hello’ @ts 20130710} 5678 Info{‘lastName’: ‘Brown’, ‘firstName’:’Joe’} pwd{‘tweet’:’xyz’ @ts 20120825, ‘tweet’:’zzz’ @ts 20130916} RDBMS View Logical Hbase View HBase – Physical View of Data Row key Column Family:Column Timestamp Value 1234 info:firstName 12345678 John 1234 Info:lastName 12345678 Smith 5678 Info:firstName 12345679 Joe 5678 Info:lastName 12345679 Brown Info column family Row key Column Family:Column Timestamp Value 1234 tweet:msg 12345678 Hello 5678 tweet:msg 12345679 xyz 5678 tweet:msg 12345999 zzz tweet column family KEY (ROW KEY, CF, QUALIFIER, TIMESTAMP) => VALUE Hbase – Logical to Physical View Row C1 C2 C3 C4 C5 C6 C7 ROW1 V1 V3 V6 ROW2 V4 V6 V7 ROW3 V6 V5 ROW4 V10 V11 V2 CF1 CF2 HFile for CF1 HFile for CF2 ROW1:CF1:C1:V1 ROW1:CF1:C3:V3 ROW2:CF1:C1:V4 ROW2:CF1:C2:V6 ROW2:CF1:C4:V7 ROW3:CF1:C3:V6 ROW4:CF1:C1:V10 ROW4:CF1:C3:V11 ROW1:CF2:C6:V6 ROW3:CF2:C6:V5 ROW4:CF2:C6:V2 Physical View HBase DB Design Considerations • Row Key design – To Leverage HBase system, row-key design is very important – Row Key must be designed based on how you access data – Salting rowkey (prefix) – Must be designed to make sure data are uniformly distributed • avoid hotspotting • Column Family design – Designed based on grouping of like information (user base info, user tweets) – Short name for column family (every row in Hfile contains the name, in bytes) – Two to three column families per Table ©2011 Cloudera, Inc. All Rights Reserved. HBase Examples hbase> create 'mytable', 'mycf‘ hbase> list hbase> put 'mytable', 'row1', 'mycf:col1', 'val1‘ hbase> put 'mytable', 'row1', 'mycf:col2', 'val2‘ hbase> put 'mytable', 'row2', 'mycf:col1', 'val3‘ hbase> scan 'mytable‘ hbase> disable 'mytable‘ hbase> drop 'mytable' Oozie – Job Workflow & Scheduling MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining) What is ? Apache Oozie • A Java Web Application – scalable, reliable and extensible system • Oozie is a workflow scheduler for Hadoop – ie., Crond for Hadoop • Workflow jobs are Directed Acyclical Graphs (DAGs) of actions • Triggered – Time (frequency) – Data (availability) Job 1 Job 3 Job 2 Job 4 Job 5 ©2011 Cloudera, Inc. All Rights Reserved. Oozie Features • Execute and monitor workflows in Hadoop • Periodic scheduling of workflows • Trigger execution of data availability • HTTP and command line interface and web console • Component Independent – MapReduce – Hive – Pig – SqoopStreaming – HDFS – sub-workflow – Java (custom Java code) Mahout – Data Mining MapReduce Runtime (Dist. Programming Framework) Hadoop Distributed File System (HDFS) Zookeeper (Coordination) Hbase (Column NoSQL DB) Sqoop/Flume (Data integration) Oozie (Job Workflow & Scheduling) Pig/Hive (Analytical Language) Hue (Web Console) Mahout (Data Mining) What is Apache Mahout • Machine-learning tool • Implements distributed and scalable machine learning techniques/algorithms on the Hadoop platform • Recommendation • Classification • Clustering • Allows to build intelligent applications easier and faster What is Apache Mahout Features • Algorithms are written on top of Hadoop – works well in distributed environment • Offers a ready-to-use framework for doing data mining tasks on large volumes of data • Lets applications to analyze large sets of data effectively and in quick time • Includes several MapReduce enabled clustering implementations – such as k-means, fuzzy k-means, Canopy, Dirichlet, and Mean-Shift • Supports Distributed Naive Bayes and Complementary Naive Bayes classification implementations • Comes with distributed fitness function capabilities for evolutionary programming ©2011 Cloudera, Inc. All Rights Reserved. Mahout Use Cases • Yahoo: Spam Detection • Foursquare: Recommendations • SpeedDate.com: Recommendations • Adobe: User Targetting • Amazon: Personalization Platform • Twitter: User Interest Modelling Use case Example • Predict what the user likes based on – His/Her historical behavior – Aggregate behavior of people similar to him Conclusion During last two lessons, we introduced: • Why Hadoop is needed • The basic concepts of HDFS and MapReduce • What sort of problems can be solved with Hadoop • What other projects are included in the Hadoop ecosystem