Towards Fast Multimedia Feature Extraction: Hadoop or Storm David Mera, Michal Batko and Pavel Zezula Laboratory of Data Intensive Systems and Applications (DISA) Masaryk University Brno, Czech Republic IEEE International Symposium on Multimedia 2014 Taichung - December 12th, 2014 Table of Contents 1 Introduction 2 Main goals 3 Processing frameworks 4 Testing scenarios 5 Infrastructure and datasets 6 Empirical evaluation 7 Conclusions and ongoing work Table of Contents 1 Introduction 2 Main goals 3 Processing frameworks 4 Testing scenarios 5 Infrastructure and datasets 6 Empirical evaluation 7 Conclusions and ongoing work Introduction Big Data “90% of the data in the world today has been created in the last two years", 2013 1 Huge new datasets are constantly created. Organizations have potential access to a wealth of information, but they do not know how to get value out of it 1 Source: SINTEF. “Big Data - for better or worse” Introduction Multimedia Big Data Multimedia Big Data 100 hours of video are uploaded to YouTube every minute 350 millions of photos are uploaded every day to Facebook (2012) Each day, 60 million photos are uploaded on Instagram ... 70% Non-Structured Data 60% Internet Traffic2 2 Source: IBM 2013 Global Technological Outlook report Introduction Multimedia Big Data Getting information from large volumes of multimedia data Content-based retrieval techniques Findability problem Extraction of suitable features → Time-consuming task Feature extraction approaches Sequential approach → not affordable Distributed computing: Cluster computing, Grid computing High computer skills ‘Ad-hoc’ approaches → Low reusability. Lack of handling failures Distributed computing: Big data approaches Batch data: Map-Reduce paradigm (Apache Hadoop) Real-time data processing: S4, Apache Storm Table of Contents 1 Introduction 2 Main goals 3 Processing frameworks 4 Testing scenarios 5 Infrastructure and datasets 6 Empirical evaluation 7 Conclusions and ongoing work Main goals Main objective To compare several distributed computing processing frameworks in order to extract suitable features from a multimedia dataset. Specifically, the comparative will be focused on Apache Hadoop3and Apache Storm4. 3 Apache Hadoop: hadoop.apache.org 4 Apache Storm: storm.apache.org Table of Contents 1 Introduction 2 Main goals 3 Processing frameworks 4 Testing scenarios 5 Infrastructure and datasets 6 Empirical evaluation 7 Conclusions and ongoing work Processing frameworks Apache Hadoop Map-Reduce paradigm Input Data Split 0 Split 1 ... Map Map Tuples(key, value) Tuples(I-key, I-value) Intermediate pairsInput pairs Split 2 Map Split n Map Reduce ... ......... Reduce Reduce Output pairs Tuples(O-key, O-value) Hadoop File System Hadoop File System Processing frameworks Apache Storm Storm runs topologies Streams: unbounded sequence of tuples Spouts: source of streams Bolts: input streams → some processing → new streams Spout Spout Bolt A Bolt A1 Bolt A2 Bolt An Bolt B Bolt B1 Bolt B2 Bolt Bn Bolt C Bolt C1 Bolt C2 Bolt Cn Bolt D Bolt D1 Bolt D2 Bolt Dn Bolt E Bolt E1 Bolt E2 Bolt En Stream of data Stream of data Stream ofdata Stream of data' Stream of data' Stream ofdata' ...... ... ...... Stream of data' Table of Contents 1 Introduction 2 Main goals 3 Processing frameworks 4 Testing scenarios 5 Infrastructure and datasets 6 Empirical evaluation 7 Conclusions and ongoing work Testing scenarios Main scenario Case-study: basis The feature extraction of images stored into external datasets. The resulting features must be placed in a distributed organizational storage. Testing scenarios Sub-scenarios Sub-scenario I The external dataset must only be processed once. Sub-scenario II The external dataset could be processed several times. Sub-scenario III The external dataset could be processed several times. However, raw data can not be internally stored due to legal restrictions. Table of Contents 1 Introduction 2 Main goals 3 Processing frameworks 4 Testing scenarios 5 Infrastructure and datasets 6 Empirical evaluation 7 Conclusions and ongoing work Infrastructure and datasets Hardware infrastructure - DISA cluster (4 nodes) 2 x Intel-E5405@2Ghz CPUs 8-physical cores 16GB of RAM 500GB SAS disk Gigabit ethernet Dataset One million of JPEG images Average size: 61.9 KB Total size: 61 GB Testing subsets 10,000 images 100,000 images 1,000,000 images Table of Contents 1 Introduction 2 Main goals 3 Processing frameworks 4 Testing scenarios 5 Infrastructure and datasets 6 Empirical evaluation 7 Conclusions and ongoing work Empirical evaluation Testing jobs Apache Hadoop - MapReduce Job Job for retrieving external multimedia datasets and store them into the HDFS as SequenceFiles Job for extracting image features Apache Storm - Topology Hadoop File System External Data Source ExtractorBolt SaveBolt Raw data Image features Stream data Storm topology Multimedia Spout SaveBolt Extraction of MPEG-7 image descriptors: MESSIF library extractor5 Feature extraction ≈ 0.5sec per image. 5 M. Batko, D. Novak, and P. Zezula,“Messif: Metric similarity search implementation framework”, in Digital Libraries: Research and Development. Springer, 2007. Empirical evaluation Evaluated metrics The Speedup ‘S’ measures how the rate of doing work increases with the number of processors k, compared to one processor S(k) = SeqJob(data) ÷ ParallelJob(data, k). Ideally, S(k) = k Efficiency ‘E’ measures the work rate per processor E(k) = S(k) ÷ k Ideally, E(k) = 1 Processing time Empirical evaluation Scalability experiments - 10,000 images 4 8 12 16 20 24 28 32 4 8 12 16 20 24 28 32 Ideal value Basic Storm Storm + RAW Storage Hadoop + Data load Basic Hadoop Processors Speedup 4 8 12 16 20 24 28 32 0 0.2 0.4 0.6 0.8 1 Processors Efficiency Empirical evaluation Scalability experiments - 100,000 images 4 8 12 16 20 24 28 32 4 8 12 16 20 24 28 32 Processors Speedup 4 8 12 16 20 24 28 32 0 0.2 0.4 0.6 0.8 1 Processors Efficiency Ideal value Basic Storm Storm + RAW Storage Hadoop + Data load Basic Hadoop Empirical evaluation Scalability experiments - 1,000,000 images 10000 100000 1000000 0 4 8 12 16 20 24 28 32 Number of images Speedup Basic HadoopIdeal value Hadoop + Data loadStorm + RAW storageBasic Storm 10000 100000 1000000 0 0.2 0.4 0.6 0.8 1 Number of images Efficiency 32 processors 32 processors Empirical evaluation Processing time - 1,000,000 images 0 200000 400000 600000 800000 1000000 0 5000 10000 15000 20000 25000 Hadoop + Data load Basic HadoopStorm + RAW StorageBasic Storm Number of images Processingtime(seconds) Table of Contents 1 Introduction 2 Main goals 3 Processing frameworks 4 Testing scenarios 5 Infrastructure and datasets 6 Empirical evaluation 7 Conclusions and ongoing work Conclusions and ongoing work Conclusions Sub-scenario 1: external data must only be processed once Hadoop is less adecuate due to the data retrieval penalty Sub-scenario 2: external data could be processed several times Apache Hadoop take advantage of data internally stored Hybrid solution: The first iteration: Apache Storm The following iterations: Apache Hadoop Exception: small-medium datasets which don’t need to be stored Sub-scenario 3: external data could be processed several times. However, they cannot be stored. Apache Storm has shown good performance for processing external datasets as long as they do not need to be stored Conclusions and ongoing work Conclusions Scalability: Storm scales better in small infrastructures, while Hadoop takes advantage of big ones Input data management: Hadoop requires data arrangement with small-medium images Configuration: Hadoop requires an iterative tuning of its configuration Job implementation: Storm is a low-level framework Job results: Hadoop must fully process data before showing results Conclusions and ongoing work Ongoing work New experiments A general adaptive system for processing multimedia datasets Towards Fast Multimedia Feature Extraction: Hadoop or Storm Thank you for your attention! David Mera dmera@mail.muni.cz