1/1
MapReduce:
Simpliﬁed Data Processing on Large Clusters
Jeﬀ Dean, Sanjay Ghemawat
Google, Inc.
December, 2004
Jeﬀ Dean, Sanjay Ghemawat MapReduce
2/1
Motivation: Large Scale Data Processing
Many tasks: Process lots of data to produce other data
Want to use hundreds or thousands of CPUs
... but this needs to be easy
MapReduce provides:
Automatic parallelization and distribution
Fault-tolerance
I/O scheduling
Status and monitoring
Jeﬀ Dean, Sanjay Ghemawat MapReduce
3/1
Programming model
Input & Output: each a set of key/value pairs Programmer
speciﬁes two functions:
map (in_key, in_value) -> list(out_key, intermediate_value)
Processes input key/value pair
Produces set of intermediate pairs
reduce (out_key, list(intermediate_value)) -> list(out_value)
Combines all intermediate values for a particular key
Produces a set of merged output values (usually just one)
Inspired by similar primitives in LISP and other languages
Jeﬀ Dean, Sanjay Ghemawat MapReduce
4/1
Example: Count word occurrences
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
Pseudocode: See appendix in paper for real code
Jeﬀ Dean, Sanjay Ghemawat MapReduce
5/1
Model is Widely Applicable
MapReduce Programs In Google Source Tree
Example uses:
distributed grep distributed sort web link-graph reversal
term-vector per host web access log stats inverted index construction
document clustering machine learning statistical machine translation
... ... ...
Jeﬀ Dean, Sanjay Ghemawat MapReduce
6/1
Implementation Overview
Typical cluster:
100s/1000s of 2-CPU x86 machines, 2-4 GB of memory
Limited bisection bandwidth
Storage is on local IDE disks
GFS: distributed ﬁle system manages data (SOSP’03)
Job scheduling system: jobs made up of tasks, scheduler
assigns tasks to machines
Implementation is a C++ library linked into user programs
Jeﬀ Dean, Sanjay Ghemawat MapReduce
7/1
Execution
Jeﬀ Dean, Sanjay Ghemawat MapReduce
8/1
Parallel Execution
Jeﬀ Dean, Sanjay Ghemawat MapReduce
9/1
Task Granularity And Pipelining
Fine granularity tasks: many more map tasks than machines
Minimizes time for fault recovery
Can pipeline shuﬄing with map execution
Better dynamic load balancing
Often use 200,000 map/5000 reduce tasks w/ 2000 machines
Jeﬀ Dean, Sanjay Ghemawat MapReduce
10/1
Jeﬀ Dean, Sanjay Ghemawat MapReduce
11/1
Jeﬀ Dean, Sanjay Ghemawat MapReduce
12/1
Jeﬀ Dean, Sanjay Ghemawat MapReduce
13/1
Jeﬀ Dean, Sanjay Ghemawat MapReduce
14/1
Jeﬀ Dean, Sanjay Ghemawat MapReduce
15/1
Jeﬀ Dean, Sanjay Ghemawat MapReduce
16/1
Jeﬀ Dean, Sanjay Ghemawat MapReduce
17/1
Jeﬀ Dean, Sanjay Ghemawat MapReduce
18/1
Jeﬀ Dean, Sanjay Ghemawat MapReduce
19/1
Jeﬀ Dean, Sanjay Ghemawat MapReduce
20/1
Jeﬀ Dean, Sanjay Ghemawat MapReduce
21/1
Fault tolerance: Handled via re-execution
On worker failure:
Detect failure via periodic heartbeats
Re-execute completed and in-progress map tasks
Re-execute in progress reduce tasks
Task completion committed through master
Master failure:
Could handle, but don’t yet (master failure unlikely)
Robust: lost 1600 of 1800 machines once, but ﬁnished ﬁne
Semantics in presence of failures: see paper
Jeﬀ Dean, Sanjay Ghemawat MapReduce
22/1
Reﬁnement: Redundant Execution
Slow workers signiﬁcantly lengthen completion time
Other jobs consuming resources on machine
Bad disks with soft errors transfer data very slowly
Weird things: processor caches disabled (!!)
Solution: Near end of phase, spawn backup copies of tasks
Whichever one ﬁnishes ﬁrst ”wins”
Eﬀect: Dramatically shortens job completion time
Jeﬀ Dean, Sanjay Ghemawat MapReduce
23/1
Reﬁnement: Locality Optimization
Master scheduling policy:
Asks GFS for locations of replicas of input ﬁle blocks
Map tasks typically split into 64MB (== GFS block size)
Map tasks scheduled so GFS input block replica are on same
machine or same rack
Eﬀect: Thousands of machines read input at local disk speed
Without this, rack switches limit read rate
Jeﬀ Dean, Sanjay Ghemawat MapReduce
24/1
Reﬁnement: Skipping Bad Records
Map/Reduce functions sometimes fail for particular inputs
Best solution is to debug & ﬁx, but not always possible
On seg fault:
Send UDP packet to master from signal handler
Include sequence number of record being processed
If master sees two failures for same record:
Next worker is told to skip the record
Eﬀect: Can work around bugs in third-party libraries
Jeﬀ Dean, Sanjay Ghemawat MapReduce
25/1
Other Reﬁnements (see paper)
Sorting guarantees within each reduce partition
Compression of intermediate data
Combiner: useful for saving network bandwidth
Local execution for debugging/testing
User-deﬁned counters
Jeﬀ Dean, Sanjay Ghemawat MapReduce
26/1
Performance
Tests run on cluster of 1800 machines:
4 GB of memory
Dual-processor 2 GHz Xeons with Hyperthreading
Dual 160 GB IDE disks
Gigabit Ethernet per machine
Bisection bandwidth approximately 100 Gbps
Two benchmarks:
MR Grep Scan 1010 100-byte records to extract records matching
a rare pattern (92K matching records)
MR Sort Sort 1010 100-byte records (modeled after TeraSort benchmark)
Jeﬀ Dean, Sanjay Ghemawat MapReduce
27/1
MR Grep
Locality optimization helps:
1800 machines read 1 TB of data at peak of ˜31 GB/s
Without this, rack switches would limit to 10 GB/s
Startup overhead is signiﬁcant for short jobs
Jeﬀ Dean, Sanjay Ghemawat MapReduce
28/1
MR Sort
Backup tasks reduce job completion time signiﬁcantly
System deals well with failures
Jeﬀ Dean, Sanjay Ghemawat MapReduce
29/1
Experience: Rewrite of Production Indexing System
Rewrote Google’s production indexing system using MapReduce
Set of 10, 14, 17, 21, 24 MapReduce operations
New code is simpler, easier to understand
MapReduce takes care of failures, slow machines
Easy to make indexing faster by adding more machines
Jeﬀ Dean, Sanjay Ghemawat MapReduce
30/1
Usage: MapReduce jobs run in August 2004
Number of jobs 29,423
Average job completion time 634 secs
Machine days used 79,186 days
Input data read 3,288 TB
Intermediate data produced 758 TB
Output data written 193 TB
Average worker machines per job 157
Average worker deaths per job 1.2
Average map tasks per job 3,351
Average reduce tasks per job 55
Unique map implementations 395
Unique reduce implementations 269
Unique map/reduce combinations 426
Jeﬀ Dean, Sanjay Ghemawat MapReduce
31/1
Related Work
Programming model inspired by functional language primitives
Partitioning/shuﬄing similar to many large-scale sorting
systems
NOW-Sort [’97]
Re-execution for fault tolerance
BAD-FS [’04] and TACC [’97]
Locality optimization has parallels with Active Disks/Diamond
work
Active Disks [’01], Diamond [’04]
Backup tasks similar to Eager Scheduling in Charlotte system
Charlotte [’96]
Dynamic load balancing solves similar problem as River’s
distributed queues
River [’99]
Jeﬀ Dean, Sanjay Ghemawat MapReduce
32/1
Conclusions
MapReduce has proven to be a useful abstraction
Greatly simpliﬁes large-scale computations at Google
Fun to use: focus on problem, let library deal w/ messy details
Thanks to Josh Levenberg, who has made many signiﬁcant improvements
and to everyone else at Google who has used and helped to improve
MapReduce.
Jeﬀ Dean, Sanjay Ghemawat MapReduce