Apache Hadoop Ecosystem
Other available tools
– inspired by
• Apache Hadoop project
– inspired by Google's MapReduce and Google File System
papers.
• Open sourced, flexible and available architecture for
large scale computation and data processing on a
network of commodity hardware
• Open Source Software + Hardware Commodity
– IT Costs Reduction
©2011 Cloudera, Inc. All Rights Reserved.
Hadoop Core
HDFS
MapReduce
©2011 Cloudera, Inc. All Rights Reserved.
HDFS
• Hadoop Distributed File System
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Write Once, Read Many Times
• Java API
• Command Line Tool
©2011 Cloudera, Inc. All Rights Reserved.
MapReduce
5
• Two Phases of Functional Programming
• Redundancy
• Fault Tolerant
• Scalable
• Self Healing
• Java API
©2011 Cloudera, Inc. All Rights Reserved.
Hadoop Core
6
HDFS
MapReduce
Java
Java
Java
Java
Word Count Example
Key: offset
Value: line
Key: word
Value: count
Key: word
Value: sum of count
Why do these tools exist?
• MapReduce is very powerful, but can be awkward to
master
• These tools allow programmers who are familiar with
other programming styles to take advantage of the
power of MapReduce
Apache Hadoop Ecosystem
Apache Hadoop Ecosystem
The Ecosystem is the System
• Hadoop has become the kernel of the distributed
operating system for Big Data
• No one uses the kernel alone
• Hadoop Ecosystem = a collection of projects at Apache
Apache Hadoop Ecosystem
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
Zookeeper – Coordination Framework
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
What is ZooKeeper?
• A centralized service for maintaining
– Configuration information
– Naming
– Providing distributed synchronization
– etc.
• A set of tools to build distributed applications that
can safely handle partial failures
– They don‘t need to implement them on their own
• ZooKeeper was designed to store coordination data
– Status information
– Configuration
– Location information
Flume / Sqoop – Data Integration Framework
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
What’s the problem for data collection?
• Collecting data (to analyze) from distributed systems
requires many mechanisms to provide reliability,
security, …
– These don‘t need to be implemented by each application
• Apache Flume
– a system used for moving massive quantities of streaming data
into HDFS
• eg., collecting log data present in log files from web servers and aggregating
it in HDFS for analysis
How can Flume help?
• A distributed data collection service
• It‘s efficiently collecting, aggregating, and moving large
amounts of data
• Supports data encryption (SSL/TLS)
• Fault tolerant, many failover and recovery mechanisms
• One-stop solution for data collection of all formats
Flume: High-Level Overview
• Logical Node
• Source
• Sink
• Channel (passive store)
©2011 Cloudera, Inc. All Rights Reserved.
Flume Architecture
Log
Flume Node
Log
Flume Node
...
HDFS
©2011 Cloudera, Inc. All Rights Reserved.
Flume Sources and Sinks
• Local Files
• HDFS
• Stdin, Stdout
• Twitter
• IRC
• IMAP
• NetCat UDP source
• Syslog
• Custom sources
• …
Sqoop
• Apache Sqoop (= Sql+Hadoop)
– Tool for efficient transferring bulk data between Apache Hadoop
and structured datastores such as relational databases
– Easy, parallel database import/export
– What you want to do?
• Insert data from RDBMS to HDFS
• Export data from HDFS back into RDBMS
©2011 Cloudera, Inc. All Rights Reserved.
Sqoop
20
RDBMS
Sqoop
HDFS
©2011 Cloudera, Inc. All Rights Reserved.
Sqoop Examples
21
$ sqoop import --connect jdbc:mysql://localhost/world --
username root --table City
...
$ hadoop fs -cat City/part-m-00000
1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,He
rat,AFG,Herat,1868004,Mazar-e-
Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200
...
Pig / Hive – Analytical Language
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
Why Hive and Pig?
• Although MapReduce is very powerful, it can also be
complex to master
• Many organizations have business or data analysts who
are skilled at writing SQL queries, but not at writing Java
code
• Many organizations have programmers who are skilled
at writing code in scripting languages
• Hive and Pig are two projects which evolved separately
to help such people analyze huge amounts of data via
MapReduce
– Hive was initially developed at Facebook, Pig at Yahoo!
Hive – Developed by
Apache Hive
– An SQL-like interface to Hadoop
• Data Warehouse infrastructure that provides data
summarization and ad hoc querying on top of Hadoop
– MapRuduce for execution
– HDFS for storage
• Hive Query Language
– Basic-SQL : Select, From, Join, Group-By
– Equi-Join, Muti-Table Insert, Multi-Group-By
– Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
©2011 Cloudera, Inc. All Rights Reserved.
Hive
25
MapReduce
Hive
SQL
Pig
Apache Pig
• A high-level scripting language (Pig Latin)
• Allows to simply write MapReduce programs
– their structure is amenable to substantial parallelization
• allow for large data processing
– optimizes automatically
• allows the user to focus on semantics rather than efficiency
• Crucial features
– easy to understand, easy to debug
– extendable (new functions)
– allows for high optimilization
A = load ‘a.txt’ as (id, name, age, ...)
B = load ‘b.txt’ as (id, address, ...)
C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
– Initiated by
©2011 Cloudera, Inc. All Rights Reserved.
Pig
MapReduce
Pig
Script
Hive vs. Pig
Hive Pig
Language HiveQL (SQL-like) Pig Latin, a scripting language
Schema Table definitions
that are stored in a
metastore
A schema is optionally defined
at runtime
ProgrammaitAccess JDBC, ODBC PigServer
• Input
• For the given sample input the map emits
• the reduce just sums up the values
Hello World Bye World
Hello Hadoop Goodbye Hadoop
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
WordCount Example
WordCount Example In MapReduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
WordCount Example By Pig
A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);
B = GROUP A BY token;
C = FOREACH B GENERATE group, COUNT(A) as count;
DUMP C;
WordCount Example By Hive
CREATE TABLE wordcount (token STRING);
LOAD DATA LOCAL INPATH ’wordcount/input'
OVERWRITE INTO TABLE wordcount;
SELECT count(*) FROM wordcount GROUP BY token;
©2011 Cloudera, Inc. All Rights Reserved.
The Story So Far …
RDBMS
Hive Pig
Sqoop
MapReduce
HDFS
FSSQL
SQL Script
Posix
Java
Java
Flume
Hbase – Column NoSQL DB
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
Structured-data vs Raw-data
I – Inspired by
Apache HBase
• Open-source Apache project
• Non-relational, distributed Database
• Runs on top of HDFS
• Modeled after Google’s BigTable technology
• Written in Java
• NoSQL (Not Only SQL) Database
• Consistent and Partition tolerant
• Runs on commodity hardware
• Able to host Very Large Databases (terabytes to petabytes)
• billions of rows & millions of columns atop clusters of commodity hardware
• Low latency random read / write to HDFS
• Many companies are using HBase
• Facebook, Twitter,Adobe, Mozilla, Yahoo!, Trend Micro, and StumbleUpon
I – Inspired by
Apache HBase is NOT
• A direct replacement for RDBMS
• ACID (Atomicity, Consistency, Isolation, and Durability)
complaint
• HBase provides row-level atomicity
• a scan is NOT consistent view of a table (neither isolated)
• all visible data is also durable data
Relational Database vs HBase
• Hardware
– Expensive Enterprise multiprocessor systems
– Same as Hadoop
• Fault Tolerance
– RDBMS are configured with high availability. Server down time
intolerable.
– Built into the architecture. Individual Node failure does not impact
overall performance.
• Database Size
– RDBMS can hold upto TBs (Tera bytes)
– Hbase can hold PBs (Peta bytes)
• Data Layout
– RDBMS are rows and columns oriented
– Hbase is Column oriented
Relational Database vs HBase
• Data Type
– Rich data type.
– Bytes
• Transactions
– Fully ACID complaint.
– ACID on single row only.
• Indexes
– PK, FK and other indexes.
– Sorted Row-key (not a real index)
HBase – workflow
HBase – Fault Tolerance
• What if region server dies?
– The HBase master will assign a new regionserver
• What if master dies?
– The back up master will take over
• What if the backup master dies?
– You are dead
• Replication of Data
– HBase achieves this using HDFS replication
mechanism
• Failure Detection
– Zookeeper is used for identifying failed region servers
HBase – Data Model
• No Schema
• Table
– Row-key must be unique
– Rows are formed by one or more columns
– Columns are grouped into Column Families
– Column Families must be defined at table creation time
– Any number of Columns per column family
– Columns can be added on the fly
– Columns can be NULL
• NULL columns are NOT stored (free of cost)
• Column only exist when inserted (Sparse)
• Cell
– Row Key, Column Family, Qualifier , Timestamp / Version
• Data represented in byte array
– Table name, Column Family name, Column name
HBase – Data Model
• Cells are “versioned”
• Table rows are sorted by row key
• Region – a row range [start-key:end-key]
HBase – Logical View of Data
ID (pk) First
Name
Last Name tweet Timestamp
1234 John Smith hello 20130710
5678 Joe Brown xyz 20120825
5678 Joe Brown zzz 20130916
Row key Value (Column Family, Qualifier, Version)
1234 Info{‘lastName’: ‘Smith’, ‘firstName’:’John’}
pwd{‘tweet’:’hello’ @ts 20130710}
5678 Info{‘lastName’: ‘Brown’, ‘firstName’:’Joe’}
pwd{‘tweet’:’xyz’ @ts 20120825,
‘tweet’:’zzz’ @ts 20130916}
RDBMS View
Logical Hbase View
HBase – Physical View of Data
Row key Column Family:Column Timestamp Value
1234 info:firstName 12345678 John
1234 Info:lastName 12345678 Smith
5678 Info:firstName 12345679 Joe
5678 Info:lastName 12345679 Brown
Info column family
Row key Column Family:Column Timestamp Value
1234 tweet:msg 12345678 Hello
5678 tweet:msg 12345679 xyz
5678 tweet:msg 12345999 zzz
tweet column family
KEY (ROW KEY, CF, QUALIFIER, TIMESTAMP) => VALUE
Hbase – Logical to Physical View
Row C1 C2 C3 C4 C5 C6 C7
ROW1 V1 V3 V6
ROW2 V4 V6 V7
ROW3 V6 V5
ROW4 V10 V11 V2
CF1 CF2
HFile for CF1 HFile for CF2
ROW1:CF1:C1:V1
ROW1:CF1:C3:V3
ROW2:CF1:C1:V4
ROW2:CF1:C2:V6
ROW2:CF1:C4:V7
ROW3:CF1:C3:V6
ROW4:CF1:C1:V10
ROW4:CF1:C3:V11
ROW1:CF2:C6:V6
ROW3:CF2:C6:V5
ROW4:CF2:C6:V2
Physical View
HBase DB Design Considerations
• Row Key design
– To Leverage HBase system, row-key design is very important
– Row Key must be designed based on how you access data
– Salting rowkey (prefix)
– Must be designed to make sure data are uniformly distributed
• avoid hotspotting
• Column Family design
– Designed based on grouping of like information (user base info, user
tweets)
– Short name for column family (every row in Hfile contains the name, in
bytes)
– Two to three column families per Table
©2011 Cloudera, Inc. All Rights Reserved.
HBase Examples
hbase> create 'mytable', 'mycf‘
hbase> list
hbase> put 'mytable', 'row1', 'mycf:col1', 'val1‘
hbase> put 'mytable', 'row1', 'mycf:col2', 'val2‘
hbase> put 'mytable', 'row2', 'mycf:col1', 'val3‘
hbase> scan 'mytable‘
hbase> disable 'mytable‘
hbase> drop 'mytable'
Oozie – Job Workflow & Scheduling
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
What is ?
Apache Oozie
• A Java Web Application
– scalable, reliable and extensible system
• Oozie is a workﬂow scheduler
for Hadoop
– ie., Crond for Hadoop
• Workflow jobs are Directed Acyclical
Graphs (DAGs) of actions
• Triggered
– Time (frequency)
– Data (availability)
Job 1
Job 3
Job 2
Job 4 Job 5
©2011 Cloudera, Inc. All Rights Reserved.
Oozie Features
• Execute and monitor workflows in Hadoop
• Periodic scheduling of workflows
• Trigger execution of data availability
• HTTP and command line interface and web console
• Component Independent
– MapReduce
– Hive
– Pig
– SqoopStreaming
– HDFS
– sub-workflow
– Java (custom Java code)
Mahout – Data Mining
MapReduce Runtime
(Dist. Programming Framework)
Hadoop Distributed File System (HDFS)
Zookeeper
(Coordination)
Hbase
(Column NoSQL DB)
Sqoop/Flume
(Data integration)
Oozie
(Job Workflow & Scheduling)
Pig/Hive (Analytical Language)
Hue
(Web Console)
Mahout
(Data Mining)
What is
Apache Mahout
• Machine-learning tool
• Implements distributed and scalable machine learning
techniques/algorithms on the Hadoop platform
• Recommendation
• Classification
• Clustering
• Allows to build intelligent applications easier and faster
What is
Apache Mahout Features
• Algorithms are written on top of Hadoop
– works well in distributed environment
• Offers a ready-to-use framework for doing data mining tasks
on large volumes of data
• Lets applications to analyze large sets of data effectively and in
quick time
• Includes several MapReduce enabled clustering
implementations
– such as k-means, fuzzy k-means, Canopy, Dirichlet, and Mean-Shift
• Supports Distributed Naive Bayes and Complementary Naive
Bayes classification implementations
• Comes with distributed fitness function capabilities for
evolutionary programming
©2011 Cloudera, Inc. All Rights Reserved.
Mahout Use Cases
• Yahoo: Spam Detection
• Foursquare: Recommendations
• SpeedDate.com: Recommendations
• Adobe: User Targetting
• Amazon: Personalization Platform
• Twitter: User Interest Modelling
Use case Example
• Predict what the user likes based on
– His/Her historical behavior
– Aggregate behavior of people similar to him
Conclusion
During last two lessons, we introduced:
• Why Hadoop is needed
• The basic concepts of HDFS and MapReduce
• What sort of problems can be solved with Hadoop
• What other projects are included in the Hadoop
ecosystem