Big Data A general approach to process multimedia datasets David Mera Masaryk University Brno, Czech Republic 24/02/2014 Table of Contents 1 Introduction Big Data Terminology 2 Big data processing systems Batch data approach Stream data approach 3 System development Highest system overview Main goals System overview Lambda architecture System architecture Main challenges System prototype Ongoing work Table of Contents 1 Introduction Big Data Terminology 2 Big data processing systems Batch data approach Stream data approach 3 System development Highest system overview Main goals System overview Lambda architecture System architecture Main challenges System prototype Ongoing work Introduction Big Data Organizations have potential access to huge datasets of heterogeneous data. Stored data are usually not structured. Data should be processed to uncover useful information. Introduction Terminology Batch data Static snapshot of a dataset Batch computation has a ‘start’ and an ‘end’ Fast datasets processing Stream data Stream of events that flows into the system at a given data rate over which we have no control Stream computation ‘never’ ends The processing system must keep up with the event rate or degrade gracefully Near-real time answers Table of Contents 1 Introduction Big Data Terminology 2 Big data processing systems Batch data approach Stream data approach 3 System development Highest system overview Main goals System overview Lambda architecture System architecture Main challenges System prototype Ongoing work MapReduce MapReduce is a framework for paralleling the processing of massive datasets. The Hadoop implementation is highly optimized for batch processing Hadoop attempts to run Map and Reduce tasks at the machines were data being processed are located Task Tracker DataNode Node 1 Task Tracker DataNode Node 2 Task Tracker DataNode Node n Secondary Name Node HDFS NameNode MapReduce Framework Job Tracker Master Node MapReduce Map and Reduce functions MapReduce Job Map function (mandatory) Computation intermediateinput Data Source Reduce function (optional) Merge function Output (0..N)intermediate Storm Distributed and fault-tolerant realtime computation Storm cluster Master node The Nimbus daemon is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures Worker nodes The Supervisor daemon listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Communication - Zookeeper Nimbus Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Supervisor Storm Components Storm runs topologies Graph of computation Each node in a topology contains processing logic Stream Unbounded sequence of tuples Spout It reads input data from an external source and emits them as a stream It is capable of replaying a tuple Bolt Input streams –> some processing –> new streams. Spout Bolt Bolt Bolt Bolt Spout Storm Parallelism of a Storm topology Topologies execute across worker processes (JVM) Tasks are spread evenly across all the workers The parallelism for each node is defined by the user User can also specify tasks for each node Stream grouping - How a stream should be partitioned i.e.Shuffle grouping Scalability in processing time TOPOLOGY Worker Process Task Task Task Task Task Task Worker Process Task Task Task Task Task Task Pink Spout Blue Bolt Green Bolt Previous conclusions “Attempting to build a general-purpose platform for both batch and stream computing would result in a highly complex system that may end up not being optimal for either task” Table of Contents 1 Introduction Big Data Terminology 2 Big data processing systems Batch data approach Stream data approach 3 System development Highest system overview Main goals System overview Lambda architecture System architecture Main challenges System prototype Ongoing work System development Highest system overview System development Goals Efficient processing of huge datasets External and internal data access Heterogeneous data management Processing of arbitrary functions Data relations management Infrastructure flexibility System development System overview ... ... ... ... ... ... Job(Algorithm, libraries, Data source) Data Source Distributed storage system Monitoring system Stream of events Event(key,TimeStamp,Datum) Processing system Subtask-1 Subtask-2 Subtask-N Subtask-M Query system query System development Lambda architecture Batch layer Storage of the master dataset Batch views computation New Data Speed layer All Data Batch View Batch View Serving LayerBatch Layer Realtime View Realtime View Query Query Merge Merge System development Lambda architecture Serving layer Batch views storage Efficient query system The views are updated whenever the batch layer finishes precomputing a batch view New Data Speed layer All Data Batch View Batch View Serving LayerBatch Layer Realtime View Realtime View Query Query Merge Merge System development Lambda architecture Speed layer Realtime processing of arbitrary functions on arbitrary data Real time views computation via incremental updates New Data Speed layer All Data Batch View Batch View Serving LayerBatch Layer Realtime View Realtime View Query Query Merge Merge System development General overview of the architecture System interface Processing Layer Computer resources Speed layer Storm Batch layer MapReduce Hadoop View layer Middleware layer JOB (URI,algorithm, libraries) Monitoring system Monitoring Layer Cluster monitoring Processing system monitoring Hadoop V2 Distributed File System HDFS YARN Cluster Resource Management Metascheduler Data Source Stream of events Event (key,timestamp,datum) System development Processing layer - Main challenges Data source access URIs (Uniform Resource Identifier) External data: Speed Layer (virtual streams) Internal data: Batch Layer Data management Heterogeneous data Data storage Data relations Timestamp Specialized Storm topologies Processing arbitrary functions Meta-language Scheduler System development Monitoring layer - Main challenges Infraestructure flexibility Dedicated hardware infrastructure: it is expensive and very often it is wasted Shared infraestructure: processing systems are not usually adapted. Main Challenges Monitoring system to analyze the status of the cluster and jobs Metascheduler to automatically modify the use of the infraestructure according to the monitoring system System development Prototype A virtual cluster via Virtual Box was deployed Hadoop and HDFS were installed (batch layer) Storm was installed (speed layer) Master node Slave nodes Nimbus (Storm) Supervisors (Storm) NameNode (HDFS) Datenodes (HDFS) System development Prototype A Spout to get external data was deployed Data containers were developed A generic bolt was designed to store data Specific implementation to deal with HDFS Bolt takes into account the block size ... ... ...... ... Storm Topology HDFS JOB (topology, libraries, server, port) Save raw data Extract features Save features Soket Spout Ongoing work Deployment of the prototype in a real cluster Comparative study between the prototype and other processing approaches Sequential computing Grid computing MapReduce Infrastructure flexibility Storm flexibility Monitoring system development Cluster status Job status Metascheduler development Big Data A general approach to process multimedia datasets Thank you for your attention!