Distributed Multi-modal Similarity Retrieval David Novak }w¡¢£¤¥¦§¨!"#$%&123456789@ACDEFGHIPQRS`ye| Seminar of DISA Lab, October 14, 2014 David Novak Multi-modal Similarity Retrieval DISA Seminar 1 / 17 Outline of the Talk 1 Motivation Similarity Search Effectiveness and Efficiency Multi-modal Search 2 Existing Solutions Similarity Indexing Distributed Key-value Stores 3 Big Data Similarity Retrieval Generic Architecture Specific System 4 Conclusions David Novak Multi-modal Similarity Retrieval DISA Seminar 2 / 17 Motivation Similarity Search Motivation The similarity is key to human cognition, learning, memory. . . [cognitive psychology] David Novak Multi-modal Similarity Retrieval DISA Seminar 3 / 17 Motivation Similarity Search Motivation The similarity is key to human cognition, learning, memory. . . [cognitive psychology] everything we can see, hear, measure, observe is in digital form David Novak Multi-modal Similarity Retrieval DISA Seminar 3 / 17 Motivation Similarity Search Motivation The similarity is key to human cognition, learning, memory. . . [cognitive psychology] everything we can see, hear, measure, observe is in digital form Therefore, computers should be able to search data base on similarity David Novak Multi-modal Similarity Retrieval DISA Seminar 3 / 17 Motivation Similarity Search Motivation The similarity is key to human cognition, learning, memory. . . [cognitive psychology] everything we can see, hear, measure, observe is in digital form Therefore, computers should be able to search data base on similarity The similarity search problem has two aspects effectiveness: how to measure similarity of two “objects” domain specific (photos, X-rays, MRT results, voice, music, EEG,. . . ) David Novak Multi-modal Similarity Retrieval DISA Seminar 3 / 17 Motivation Similarity Search Motivation The similarity is key to human cognition, learning, memory. . . [cognitive psychology] everything we can see, hear, measure, observe is in digital form Therefore, computers should be able to search data base on similarity The similarity search problem has two aspects effectiveness: how to measure similarity of two “objects” domain specific (photos, X-rays, MRT results, voice, music, EEG,. . . ) efficiency: how to realize similarity search fast using a given similarity measure on very large data collections David Novak Multi-modal Similarity Retrieval DISA Seminar 3 / 17 Motivation Effectiveness and Efficiency Efficiency: Motivation Example Type of data: general images (photos) David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17 Motivation Effectiveness and Efficiency Efficiency: Motivation Example Type of data: general images (photos) every image has been processed by a deep neural network to obtain a “semantic characterization” of the image (descriptor) David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17 Motivation Effectiveness and Efficiency Efficiency: Motivation Example Type of data: general images (photos) every image has been processed by a deep neural network to obtain a “semantic characterization” of the image (descriptor) compared by Euclidean distance, it measures visual similarity of images David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17 Motivation Effectiveness and Efficiency Efficiency: Motivation Example Type of data: general images (photos) every image has been processed by a deep neural network to obtain a “semantic characterization” of the image (descriptor) compared by Euclidean distance, it measures visual similarity of images David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17 Motivation Effectiveness and Efficiency Efficiency: Motivation Example Type of data: general images (photos) every image has been processed by a deep neural network to obtain a “semantic characterization” of the image (descriptor) compared by Euclidean distance, it measures visual similarity of images David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17 Motivation Effectiveness and Efficiency Efficiency: Motivation Example Type of data: general images (photos) every image has been processed by a deep neural network to obtain a “semantic characterization” of the image (descriptor) compared by Euclidean distance, it measures visual similarity of images Efficiency problem: what if we had 100 million of images with such descriptors each descriptor is a 4096-dimensional float vector David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17 Motivation Effectiveness and Efficiency Efficiency: Motivation Example Type of data: general images (photos) every image has been processed by a deep neural network to obtain a “semantic characterization” of the image (descriptor) compared by Euclidean distance, it measures visual similarity of images Efficiency problem: what if we had 100 million of images with such descriptors each descriptor is a 4096-dimensional float vector ⇒ over 1.5 TB of data to be organized for similarity search answer similarity queries online David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17 Motivation Multi-modal Search Real Application: Multi-field Data real-world application data objects would have many “fields”: attribute fields (numbers, strings, dates, etc.) (several) descriptors for similarity search keywords/annotations for full-text search, etc. David Novak Multi-modal Similarity Retrieval DISA Seminar 5 / 17 Motivation Multi-modal Search Real Application: Multi-field Data real-world application data objects would have many “fields”: attribute fields (numbers, strings, dates, etc.) (several) descriptors for similarity search keywords/annotations for full-text search, etc. example: { "ID": "image_1", "author": "David Novak", "date": "20140327", "categories": [ "outdoor", "family" ], "DNN_visual_descriptor": [5.431, 0.0042, 0.0, 0.97,... ], "dominant_color": "0x9E, 0xC2, 0x13", "keywords": "summer, beach, ocean, sun, sand" } David Novak Multi-modal Similarity Retrieval DISA Seminar 5 / 17 Motivation Multi-modal Search Objectives Goal: generic, horizontally scalable system architecture that would allow standard attribute-based access keyword (full-text) search similarity search in “arbitrary” similarity space David Novak Multi-modal Similarity Retrieval DISA Seminar 6 / 17 Motivation Multi-modal Search Objectives Goal: generic, horizontally scalable system architecture that would allow standard attribute-based access keyword (full-text) search similarity search in “arbitrary” similarity space multi-modal search – combination of several search perspectives, e.g. direct combination of similarity modalities similarity query with filtering by attribute(s) re-ranking of search result by different criteria David Novak Multi-modal Similarity Retrieval DISA Seminar 6 / 17 Motivation Multi-modal Search Objectives Goal: generic, horizontally scalable system architecture that would allow standard attribute-based access keyword (full-text) search similarity search in “arbitrary” similarity space multi-modal search – combination of several search perspectives, e.g. direct combination of similarity modalities similarity query with filtering by attribute(s) re-ranking of search result by different criteria . . . and do it all on a very large scale voluminous data collections high query throughput David Novak Multi-modal Similarity Retrieval DISA Seminar 6 / 17 Existing Solutions Similarity Indexing Distance-based Similarity Search generic similarity search applicable to many domains data modeled as metric space (D, δ), where D is a domain of objects and δ is a total distance function δ : D × D −→ R+ 0 satisfying postulates of identity, symmetry, and triangle inequality David Novak Multi-modal Similarity Retrieval DISA Seminar 7 / 17 Existing Solutions Similarity Indexing Distance-based Similarity Search generic similarity search applicable to many domains data modeled as metric space (D, δ), where D is a domain of objects and δ is a total distance function δ : D × D −→ R+ 0 satisfying postulates of identity, symmetry, and triangle inequality query by example: K-NN(q) returns K objects x from the dataset X ⊆ D with the smallest δ(q, x) 23 1 q David Novak Multi-modal Similarity Retrieval DISA Seminar 7 / 17 Existing Solutions Similarity Indexing Similarity Indexing Techniques Metric-based similarity indexing: two decades of research memory structures for precise K-NN search David Novak Multi-modal Similarity Retrieval DISA Seminar 8 / 17 Existing Solutions Similarity Indexing Similarity Indexing Techniques Metric-based similarity indexing: two decades of research memory structures for precise K-NN search efficient disk-oriented techniques precise and approximate (not all objects from K-NN answer returned) David Novak Multi-modal Similarity Retrieval DISA Seminar 8 / 17 Existing Solutions Similarity Indexing Similarity Indexing Techniques Metric-based similarity indexing: two decades of research memory structures for precise K-NN search efficient disk-oriented techniques precise and approximate (not all objects from K-NN answer returned) objects are partitioned and organized on disk by the similarity metric disk storage David Novak Multi-modal Similarity Retrieval DISA Seminar 8 / 17 Existing Solutions Similarity Indexing Similarity Indexing Techniques Metric-based similarity indexing: two decades of research memory structures for precise K-NN search efficient disk-oriented techniques precise and approximate (not all objects from K-NN answer returned) objects are partitioned and organized on disk by the similarity metric given query q, the “most-promising” partitions form the candidate set disk storage David Novak Multi-modal Similarity Retrieval DISA Seminar 8 / 17 Existing Solutions Similarity Indexing Similarity Indexing Techniques Metric-based similarity indexing: two decades of research memory structures for precise K-NN search efficient disk-oriented techniques precise and approximate (not all objects from K-NN answer returned) objects are partitioned and organized on disk by the similarity metric given query q, the “most-promising” partitions form the candidate set the candidate set SC is refined by calculating δ(q, x), ∀x ∈ SC disk storage David Novak Multi-modal Similarity Retrieval DISA Seminar 8 / 17 Existing Solutions Similarity Indexing Similarity Indexing Techniques: Metadata Organization Recently, there were proposed a few indexes of different type David Novak Multi-modal Similarity Retrieval DISA Seminar 9 / 17 Existing Solutions Similarity Indexing Similarity Indexing Techniques: Metadata Organization Recently, there were proposed a few indexes of different type memory index that organizes only metadata Novak, D., & Zezula, P. (2014). Rank Aggregation of Candidate Sets for Efficient Similarity Search. In DEXA 2014, Springer. David Novak Multi-modal Similarity Retrieval DISA Seminar 9 / 17 Existing Solutions Similarity Indexing Similarity Indexing Techniques: Metadata Organization Recently, there were proposed a few indexes of different type memory index that organizes only metadata Novak, D., & Zezula, P. (2014). Rank Aggregation of Candidate Sets for Efficient Similarity Search. In DEXA 2014, Springer. similarity index IX candidate set CX ⊆ X k-NN(q) refined answer data storage 1 calculate δ(q,c), c ∊ CX 2 3 David Novak Multi-modal Similarity Retrieval DISA Seminar 9 / 17 Existing Solutions Similarity Indexing Distributed Similarity Indexes Distributed Data Structures for metric-based similarity search data partitioned to nodes according to the metric at query time, query-relevant partitions (nodes) accessed David Novak Multi-modal Similarity Retrieval DISA Seminar 10 / 17 Existing Solutions Similarity Indexing Distributed Similarity Indexes Distributed Data Structures for metric-based similarity search data partitioned to nodes according to the metric at query time, query-relevant partitions (nodes) accessed GHT , VPT , MCAN, M-Chord David Novak Multi-modal Similarity Retrieval DISA Seminar 10 / 17 Existing Solutions Similarity Indexing Distributed Similarity Indexes Distributed Data Structures for metric-based similarity search data partitioned to nodes according to the metric at query time, query-relevant partitions (nodes) accessed GHT , VPT , MCAN, M-Chord C2 p1 p2 2*c 3*c0 c C C1 p 0 0 (mchord) K2 N2 K1 N1K3 N3 N4 K N K4 +2 +16 +8 +4 (a) (b) +1 0 = 32 0 = 32 David Novak Multi-modal Similarity Retrieval DISA Seminar 10 / 17 Existing Solutions Distributed Key-value Stores Current Distributed Stores Currently, many efficient distributed key-value or document stores emerged distributed hash tables objects organized by IDs (ID-object map) quick access to “documents” by IDs secondary indexes on attributes David Novak Multi-modal Similarity Retrieval DISA Seminar 11 / 17 Existing Solutions Distributed Key-value Stores Current Distributed Stores Currently, many efficient distributed key-value or document stores emerged distributed hash tables objects organized by IDs (ID-object map) quick access to “documents” by IDs secondary indexes on attributes David Novak Multi-modal Similarity Retrieval DISA Seminar 11 / 17 Big Data Similarity Retrieval Generic Architecture Generic Architecture key-value store (ID-object) on the whole dataset X worker worker worker worker worker David Novak Multi-modal Similarity Retrieval DISA Seminar 12 / 17 Big Data Similarity Retrieval Generic Architecture Generic Architecture key-value store (ID-object) on the whole dataset X worker worker similarity index IXi field worker worker worker David Novak Multi-modal Similarity Retrieval DISA Seminar 12 / 17 Big Data Similarity Retrieval Generic Architecture Generic Architecture key-value store (ID-object) on the whole dataset X worker worker similarity index IXi field worker worker k-NN(q.field) candidate set CXi ⊆ Xi 1 worker refinement on part of CXi 2 merge partial answers3 refinement on part of CXi 2 David Novak Multi-modal Similarity Retrieval DISA Seminar 12 / 17 Big Data Similarity Retrieval Generic Architecture Generic Architecture key-value store (ID-object) on the whole dataset X worker worker similarity index IXi field inverted file index IXi field2 attribute index IXi field3 similarity index IXj field worker worker k-NN(q.field) candidate set CXi ⊆ Xi 1 worker refinement on part of CXi 2 merge partial answers3 refinement on part of CXi 2 David Novak Multi-modal Similarity Retrieval DISA Seminar 12 / 17 Big Data Similarity Retrieval Generic Architecture System Features Types of queries ID-object query (often useful to initiate k-NN(q) query) attribute-based queries (secondary indexes) key-word (full-text) queries (Lucene-like index) similarity queries (via similarity indexes) David Novak Multi-modal Similarity Retrieval DISA Seminar 13 / 17 Big Data Similarity Retrieval Generic Architecture System Features Types of queries ID-object query (often useful to initiate k-NN(q) query) attribute-based queries (secondary indexes) key-word (full-text) queries (Lucene-like index) similarity queries (via similarity indexes) combined similarity queries (late fusion) K-NN query with attribute filtering distributed re-ranking query answer David Novak Multi-modal Similarity Retrieval DISA Seminar 13 / 17 Big Data Similarity Retrieval Generic Architecture System Features Types of queries ID-object query (often useful to initiate k-NN(q) query) attribute-based queries (secondary indexes) key-word (full-text) queries (Lucene-like index) similarity queries (via similarity indexes) combined similarity queries (late fusion) K-NN query with attribute filtering distributed re-ranking query answer efficient management of multiple data collections X = X1 ∪ X2 ∪ · · · ∪ Xs core key-value store is well horizontally scalable David Novak Multi-modal Similarity Retrieval DISA Seminar 13 / 17 Big Data Similarity Retrieval Specific System Specific System: Large-scale Image Management 100M objects from the CoPhIR dataset (benchmark): { "ID": "002561195", "title": "My wife & daughter on Gold Coast beach", "tags": "summer, beach, ocean, sun, sand, Australia", "mpeg7_scalable_color": "25 36 0 127 69...", "mpeg7_color_layout": "25 41 53 20; 32; -16...", "mpeg7_color_structure": "25 41 53 20; 32;...", "mpeg7_edge_histogram": "5 1 2 3 7 7 3 6...", "mpeg7_homogeneous_texture": "232 201 198 180 201...", "GPS_coordinates": "45.50382, -73.59921", "flickr_user": "david_novak" } David Novak Multi-modal Similarity Retrieval DISA Seminar 14 / 17 Big Data Similarity Retrieval Specific System System Schema Infinispan (Ispn) ID-object store Lucene I tags,title B+ -tree I flickr_user PPP-Codes index I GPS PPP-Codes index I mpeg7 index worker Ispn node worker Ispn node index worker Ispn node worker Ispn node index worker Ispn node index David Novak Multi-modal Similarity Retrieval DISA Seminar 15 / 17 Big Data Similarity Retrieval Specific System Specific System: Demo 20M objects of this type: { "ID": "002561195", "title": "My wife & daughter on Gold Coast beach", "keywords": "summer, beach, ocean, sun, sand, Australia", "DNN_visual_descriptor": [5.431, 0.0042, 0.0, 0.97,... ] } demo David Novak Multi-modal Similarity Retrieval DISA Seminar 16 / 17 Conclusions Conclusions We have proposed and alfa-tested system architecture that provides large-scale similarity search ...on a broad family of data + similarity measures is distributed and horizontally scalable David Novak Multi-modal Similarity Retrieval DISA Seminar 17 / 17 Conclusions Conclusions We have proposed and alfa-tested system architecture that provides large-scale similarity search ...on a broad family of data + similarity measures is distributed and horizontally scalable can manage multi-field data: attribute, keywords, several similarity modalities many variants of multi-modal search queries David Novak Multi-modal Similarity Retrieval DISA Seminar 17 / 17 Conclusions Conclusions We have proposed and alfa-tested system architecture that provides large-scale similarity search ...on a broad family of data + similarity measures is distributed and horizontally scalable can manage multi-field data: attribute, keywords, several similarity modalities many variants of multi-modal search queries Challenges: full implementation and thorough testing the similarity index can be bottleneck =⇒ distribute it David Novak Multi-modal Similarity Retrieval DISA Seminar 17 / 17