Apache Hadoop & others
general introduction
Open-Source Solution for Huge Data Sets
Tomáš Rebok
Masaryk University
Slides inspired by: Zheng Shao
⚫ Open-source software framework designed for
storage and processing of large-scale data on clusters
of commodity hardware
⚫ Created by Doug Cutting and Mike Carafella in 2005.
⚫ Cutting named the program after his son’s toy
elephant.
What is Apache Hadoop?
⚫ Data-intensive text processing
⚫ Assembly of large genomes
⚫ Graph mining
⚫ Machine learning and data mining
⚫ Large scale social network analysis
Uses for Hadoop
Who Uses Hadoop?
• Contains Libraries and other
modules
Hadoop
Common
• Hadoop Distributed File SystemHDFS
• Yet Another Resource Negotiator
Hadoop
YARN
• A programming model for large
scale data processing
Hadoop
MapReduce
The Hadoop Ecosystem
The Hadoop Ecosystem
Motivations for Hadoop
What considerations led to its design
⚫ What were the limitations of earlier large-scale
computing?
⚫ What requirements should an alternative approach
have?
⚫ How does Hadoop address those requirements?
Motivations for Hadoop
⚫ Historically computation was processor-bound
⚫ Data volume has been relatively small
⚫ Complicated computations are performed on that data
⚫ Advances in computer technology has historically
centered around improving the power of a single
machine
Early Large Scale Computing
Cray-1
⚫ Moore’s Law
⚫ The number of transistors on a dense integrated
circuit doubles every two years
⚫ Single-core computing can’t scale with
current computing needs
Advances in CPUs
⚫ Power consumption limits the speed
increase we get from transistor density
Single-Core Limitation
⚫ Allows developers
to use multiple
machines for a
single task
Distributed Systems
⚫ Programming on a distributed system is much more
complex
⚫ Synchronizing data exchanges
⚫ Managing a finite bandwidth
⚫ Controlling computation timing is complicated
⚫ Communication overhead
⚫ …
Distributed System: Problems
“You know you have a distributed system when the
crash of a computer you’ve never
heard of stops you from getting any work done.” –
Leslie Lamport
⚫ Distributed systems must be designed with the
expectation of failure
Distributed System: Problems
⚫ Typically divided into Data Nodes and Compute
Nodes
⚫ At compute time, data is copied to the Compute
Nodes
⚫ Fine for relatively small amounts of data
⚫ Modern systems deal with far more data than was
gathering in the past
Distributed System: Data Storage
⚫ Facebook
⚫ 500 TB per day
⚫ Yahoo
⚫ Over 170 PB
⚫ eBay
⚫ Over 6 PB
⚫ Getting the data to the processors becomes the
bottleneck
How much data?
⚫ Must support partial
failure
⚫ Must be scalable
Requirements for Hadoop
⚫ Failure of a single component must not
cause the failure of the entire system only a
degradation of the application performance
Partial Failures
 Failure should not
result in the loss of
any data
⚫ If a component fails, it should be able to recover
without restarting the entire system
⚫ Component failure or recovery during a job must not
affect the final output
Component Recovery
⚫ Increasing resources should increase load capacity
⚫ Increasing the load on the system should result in a
graceful decline in performance for all jobs
⚫ Not system failure
Scalability
⚫ Based on the work done by Google in the early 2000s
⚫ “The Google File System” in 2003
⚫ “MapReduce: Simplified Data Processing on Large Clusters” in
2004
⚫ The core idea was to distribute the data as it is initially
stored
⚫ Each node can then perform computation on the data it stores
without moving the data for the initial processing
Hadoop
⚫ Applications are written in a high-level programming
language
⚫ No network programming or temporal dependency
⚫ Nodes should communicate as little as possible
⚫ A “shared nothing” architecture
⚫ Data is spread among the machines in advance
⚫ Perform computation where the data is already stored as often as
possible
Core Hadoop Concepts
⚫ When data is loaded onto the system it is divided into
blocks
⚫ Typically 64MB or 128MB
⚫ Tasks are divided into two phases
⚫ Map tasks which are done on small portions of data where the
data is stored
⚫ Reduce tasks which combine data to produce the final output
⚫ A master program allocates work to individual nodes
High-Level Overview
⚫ Failures are detected by the master program which
reassigns the work to a different node
⚫ Restarting a task does not affect the nodes working
on other portions of the data
⚫ If a failed node restarts, it is added back to the system
and assigned new tasks
⚫ The master can redundantly execute the same task to
avoid slow running nodes
Fault Tolerance
Hadoop Overview
⚫ Open Source Apache Project
⚫ http://hadoop.apache.org/
⚫ Book: http://oreilly.com/catalog/9780596521998/index.html
⚫ Written in Java
⚫ Does work with other languages
⚫ Runs on
⚫ Linux, Windows and more
⚫ Commodity hardware with high failure rate
Current Status of Hadoop
⚫ Largest Cluster (Facebook)
⚫ 2000 nodes (1200 nodes with 8 cores + 800 nodes with 16
cores)
⚫ 21 PB HDFS storage (12 TB per node)
⚫ 32 GB of RAM per node
⚫ Used by 40+ companies / universities over the
world
⚫ Yahoo, Facebook, etc
⚫ Cloud Computing Donation from Google and IBM
⚫ Startup focusing on providing services for
Hadoop
⚫ Cloudera
Hadoop Components
⚫ Hadoop Distributed File System (HDFS)
⚫ Hadoop Map-Reduce
⚫ Contributes
⚫ Hadoop Streaming
⚫ Pig / JAQL / Hive
⚫ HBase
⚫ Hama / Mahout
⚫ …
Hadoop Distributed File
System (HDFS)
A distributed file system designed to run on commodity hardware
⚫ Responsible for storing data on the cluster
⚫ Data files are split into blocks and distributed across
the nodes in the cluster
⚫ Each block is replicated multiple times
Overview
⚫ HDFS is a file system written in Java based on the
Google’s GFS
⚫ Provides redundant storage for massive amounts of
data
HDFS Basic Concepts
⚫ HDFS works best with a smaller number of large files
⚫ Millions as opposed to billions of files
⚫ Typically 100MB or more per file
⚫ Files in HDFS are write once
⚫ Optimized for streaming reads of large files and not
random reads
HDFS Basic Concepts
⚫ Files are split into blocks
⚫ 128 MB by default
⚫ can be adapted based on requirements
⚫ Blocks are split across many machines at load time
⚫ Different blocks from the same file will be stored on different
machines
⚫ Blocks are replicated across multiple machines
⚫ The NameNode keeps track of which blocks make up
a file and where they are stored
How are Files Stored
⚫ Default replication is 3-fold
Data Replication
⚫ When a client wants to retrieve data
⚫ Communicates with the NameNode to determine which blocks
make up a file and on which data nodes those blocks are stored
⚫ Then communicated directly with the data nodes to read the data
Data Retrieval
HDFS Architecture
HDFS User Interface (client)
⚫ Java API
⚫ Command Line
⚫ bin/hdfs dfs -mkdir /foodir
⚫ bin/hdfs dfs -cat /foodir/myfile.txt
⚫ bin/hdfs dfs -cp /foodir/myfile.txt /foodir/myfile-backup.txt
⚫ bin/hdfs dfs -mv /foodir/myfile-backup.txt /foodir/myfile.txt
⚫ bin/hdfs dfs -rm /foodir/myfile.txt
⚫ bin/hdfs dfsadmin -report
⚫ bin/hdfs dfsadmin -decommission datanodename
⚫ Web Interface
⚫ http://host:port/dfshealth.jsp
More about HDFS
⚫ http://hadoop.apache.org/core/docs/current/hdfs_design.html
⚫ Hadoop FileSystem API
⚫ HDFS
⚫ Local File System
⚫ Kosmos File System (KFS)
⚫ Amazon S3 File System
MapReduce
Distributing computation across nodes
⚫ A method for distributing computation across multiple
nodes
⚫ Fits a lot of batch processing applications
‒ Log processing
‒ Web index building
⚫ Each node processes the data that is stored at that node
⚫ Consists of two main phases
⚫ Map
⚫ Reduce
MapReduce Overview
⚫ Automatic parallelization and distribution
⚫ Fault-Tolerance
⚫ Provides a clean abstraction for programmers to use
MapReduce Features
⚫ Reads data as key/value pairs
⚫ The key is often discarded
⚫ Outputs zero or more key/value pairs
The Mapper
⚫ Output from the mapper is sorted by key
⚫ All values with the same key are guaranteed to go to
the same machine
Shuffle and Sort
⚫ Called once for each unique key
⚫ Gets a list of all values associated with a key as input
⚫ The reducer outputs zero or more final key/value pairs
⚫ Usually just one output per input key
The Reducer
MapReduce: Word Count
⚫ When employed, two principal questions have to
be answered:
1. Which key-value pairs should be provided by Mappers?
2. How to combine values with the same keys during the
Reduce stage?
Map/Reduce Usage Examples
Example tasks:
1. From a CSV file, extract all the records having X
value lower than 5.
2. Combine (sets union) entries of 2 CSV files into single
one.
3. Perform set intersection of entries of 2 CSV files.
4. Given 2 CSV files X and Y. Perform set difference, ie.
X-Y.
Map/Reduce Usage Examples
Hadoop Streaming
⚫ Allow to write Map and Reduce functions in any
languages
⚫ because Hadoop Map/Reduce only accepts Java
⚫ makes development easier
⚫ Example: Word Count
⚫ hadoop streaming
-input /user/zshao/articles
-mapper ‘tr “ ” “\n”’
-reducer ‘uniq -c‘
-output /user/zshao/
-numReduceTasks 32
Hadoop Usage
Other available tools
Czech infrastructures (presented earlier) provide access
to prepared Hadoop cluster:
⚫ https://wiki.metacentrum.cz/wiki/Kategorie:Hadoop
⚫ current resources:
⚫ 27 nodes ⨉ 16 cores
⚫ 27 nodes ⨉ 128 GB RAM
⚫ 1 PB HDFS storage
⚫ documentation:
https://wiki.metacentrum.cz/wiki/Hadoop
Hadoop Usage in e-INFRA CZ