Big Data
A general approach to process multimedia datasets
David Mera
Masaryk University
Brno, Czech Republic
24/02/2014
Table of Contents
1 Introduction
Big Data
Terminology
2 Big data processing systems
Batch data approach
Stream data approach
3 System development
Highest system overview
Main goals
System overview
Lambda architecture
System architecture
Main challenges
System prototype
Ongoing work
Table of Contents
1 Introduction
Big Data
Terminology
2 Big data processing systems
Batch data approach
Stream data approach
3 System development
Highest system overview
Main goals
System overview
Lambda architecture
System architecture
Main challenges
System prototype
Ongoing work
Introduction
Big Data
Organizations have potential access to huge datasets of
heterogeneous data.
Stored data are usually not structured.
Data should be processed to uncover useful information.
Introduction
Terminology
Batch data
Static snapshot of a dataset
Batch computation has a ‘start’ and an ‘end’
Fast datasets processing
Stream data
Stream of events that ﬂows into the system at a given data
rate over which we have no control
Stream computation ‘never’ ends
The processing system must keep up with the event rate or
degrade gracefully
Near-real time answers
Table of Contents
1 Introduction
Big Data
Terminology
2 Big data processing systems
Batch data approach
Stream data approach
3 System development
Highest system overview
Main goals
System overview
Lambda architecture
System architecture
Main challenges
System prototype
Ongoing work
MapReduce
MapReduce is a framework for paralleling the processing of
massive datasets.
The Hadoop implementation is highly optimized for batch
processing
Hadoop attempts to run Map and Reduce tasks at the
machines were data being processed are located
Task Tracker
DataNode
Node 1
Task Tracker
DataNode
Node 2
Task Tracker
DataNode
Node n
Secondary
Name Node
HDFS
NameNode
MapReduce
Framework
Job Tracker
Master Node
MapReduce
Map and Reduce functions
MapReduce Job
Map function (mandatory)
Computation
intermediate<key', value>input<key,value>
Data Source
Reduce function (optional)
Merge function
Output (0..N)intermediate<key',value>
Storm
Distributed and fault-tolerant realtime computation
Storm cluster
Master node
The Nimbus daemon is responsible for distributing code around
the cluster, assigning tasks to machines, and monitoring for
failures
Worker nodes
The Supervisor daemon listens for work assigned to its machine
and starts and stops worker processes as necessary based on
what Nimbus has assigned to it.
Communication - Zookeeper
Nimbus
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor
Supervisor
Storm
Components
Storm runs topologies
Graph of computation
Each node in a topology contains processing logic
Stream
Unbounded sequence of tuples
Spout
It reads input data from an external source and emits them as
a stream
It is capable of replaying a tuple
Bolt
Input streams –> some processing –> new streams.
Spout
Bolt
Bolt
Bolt
Bolt
Spout
Storm
Parallelism of a Storm topology
Topologies execute across worker processes (JVM)
Tasks are spread evenly across all the workers
The parallelism for each node is deﬁned by the user
User can also specify tasks for each node
Stream grouping - How a stream should be partitioned
i.e.Shufﬂe grouping
Scalability in processing time
TOPOLOGY
Worker Process
Task
Task
Task
Task
Task
Task
Worker Process
Task
Task
Task
Task
Task
Task
Pink
Spout
Blue
Bolt
Green
Bolt
Previous conclusions
“Attempting to build a general-purpose platform for both batch and
stream computing would result in a highly complex system that
may end up not being optimal for either task”
Table of Contents
1 Introduction
Big Data
Terminology
2 Big data processing systems
Batch data approach
Stream data approach
3 System development
Highest system overview
Main goals
System overview
Lambda architecture
System architecture
Main challenges
System prototype
Ongoing work
System development
Highest system overview
System development
Goals
Efﬁcient processing of huge datasets
External and internal data access
Heterogeneous data management
Processing of arbitrary functions
Data relations management
Infrastructure ﬂexibility
System development
System overview
...
... ... ...
...
...
Job(Algorithm, libraries, Data source)
Data Source
Distributed storage system
Monitoring
system
Stream of events
Event(key,TimeStamp,Datum)
Processing system
Subtask-1 Subtask-2 Subtask-N
Subtask-M
Query
system
query
System development
Lambda architecture
Batch layer
Storage of the master dataset
Batch views computation
New Data
Speed layer
All Data Batch View
Batch View
Serving LayerBatch Layer
Realtime View
Realtime View
Query
Query
Merge
Merge
System development
Lambda architecture
Serving layer
Batch views storage
Efﬁcient query system
The views are updated whenever the batch layer ﬁnishes
precomputing a batch view
New Data
Speed layer
All Data Batch View
Batch View
Serving LayerBatch Layer
Realtime View
Realtime View
Query
Query
Merge
Merge
System development
Lambda architecture
Speed layer
Realtime processing of arbitrary functions on arbitrary data
Real time views computation via incremental updates
New Data
Speed layer
All Data Batch View
Batch View
Serving LayerBatch Layer
Realtime View
Realtime View
Query
Query
Merge
Merge
System development
General overview of the architecture
System interface
Processing Layer
Computer resources
Speed layer
Storm
Batch layer
MapReduce
Hadoop
View layer
Middleware layer
JOB (URI,algorithm, libraries)
Monitoring system
Monitoring Layer
Cluster
monitoring
Processing
system
monitoring
Hadoop V2
Distributed File System
HDFS
YARN
Cluster Resource Management
Metascheduler
Data Source
Stream of events
Event (key,timestamp,datum)
System development
Processing layer - Main challenges
Data source access
URIs (Uniform Resource Identiﬁer)
External data: Speed Layer (virtual streams)
Internal data: Batch Layer
Data management
Heterogeneous data
Data storage
Data relations
Timestamp
Specialized Storm topologies
Processing arbitrary functions
Meta-language
Scheduler
System development
Monitoring layer - Main challenges
Infraestructure ﬂexibility
Dedicated hardware infrastructure: it is expensive and very
often it is wasted
Shared infraestructure: processing systems are not usually
adapted.
Main Challenges
Monitoring system to analyze the status of the cluster and jobs
Metascheduler to automatically modify the use of the
infraestructure according to the monitoring system
System development
Prototype
A virtual cluster via Virtual Box was deployed
Hadoop and HDFS were installed (batch layer)
Storm was installed (speed layer)
Master node
Slave nodes
Nimbus (Storm)
Supervisors (Storm)
NameNode (HDFS)
Datenodes (HDFS)
System development
Prototype
A Spout to get external data was deployed
Data containers were developed
A generic bolt was designed to store data
Speciﬁc implementation to deal with HDFS
Bolt takes into account the block size
...
...
......
...
Storm
Topology
HDFS
JOB (topology, libraries, server, port)
Save raw data
Extract
features
Save features
Soket
Spout
Ongoing work
Deployment of the prototype in a real cluster
Comparative study between the prototype and other
processing approaches
Sequential computing
Grid computing
MapReduce
Infrastructure ﬂexibility
Storm ﬂexibility
Monitoring system development
Cluster status
Job status
Metascheduler development
Big Data
A general approach to process multimedia datasets
Thank you for your attention!