Similarity based event detection
system for videos
Vojtěch Zavřel
FI MUNI 2011
1. Motivation
• Growing number of public videos
• Copyright infringement
• Sport streams
• Surveillance
• Video archives
– public
– television
– etc.
2/16
1. Video definition
• Different aspects
– image - MPEG-7 descriptors
– text -automatic Speach Recognition, captions
– sound
– motion
– temporal
3/16
2. Event detection
• Event
– one or multiple defined aspects occurred in
video in time interval
– joined by operators AND, OR, SEQUENCE
• Example
– TV news (by image) AND about IRAQ (by text)
AND burning vehicles (by image) AND time
interval < 1 minute (by temporal)
4/16
2. Event detection
• Text
• Image
– global – some of general features
– local - spatio-temporal detection
• Sound
– music sounds like ...
• Motion detection
– based on background
– picture deformation
5/16
3. Current approach
• Common principles
– annotation based systems (manual vs. auto)
• VARS, HASTAC, iVAS
– learning-based systems
• object-based
• concept-based
• Domain
– specific
– general
6/16
3. Current approach – domain specific
• Domain specific
– well studied for limited domain like tenis,
surveillance
– well known objects
– huge training sets
– specialized
structures
– some are realtime
7/16
3. Current approach – domain specific
• Surveillance – common use
•
•
•
•
•
• Object tracking...
• Face detection...
8/16
3. Current approach – general domain
• Domain nonspecific (general)
– based on learning algorithms (training necessary)
– multi-aspect oriented
– good results
– concept-based
– small datasets
9/16
3. Strengths & weaknesses
• Good results -> critical applications
• Usable for domain specific
• Combine multiple aspects
• Necesarry to have enough training set
– usually described by people
– usually 40% of whole database
• Often usable only on small datasets
10/16
4. Our aproach - ViMUF
• Similarity event detection video
framework (ViMUF)
– based on similarity principles not learning
mechanisms
– domain nonspecific
– multi-aspect combination (image, sound,
text, motion, temporal)
– user-supplied aggregation function
– usable on large video datasets
11/16
4. System goals
• Similarity based event detection
• Create general interfaces for different
extractor
• UI for defigning events based on patterns
and different aspects combination with
operators AND, OR and SEQUENCE
• Usable on huge datasets
12/16
4. Lifecycle - extraction
• Split video to scenes
– extract keyframes and add temporal
information (image aspect)
– extract text (OCR, ASR) and add temporal
information (text aspect)
– extract sound (music) descriptors and add
temporal information (sound aspect)
– extract camera and motion vectors (motion
aspect
– put together temporal aspect
13/16
4. Query processing
• User defined function
•
– Can be used without training set
– Possibility to use multi-aspects query
– Query function can be defined by user
• Multi Modality Processor
14/16
4. Query processing
15/16
6. Conclusion
• Video
– Aspects: image, text, sound, motion,
temporal
– Event detection
• domain specific, nonspecific
• learning mechanisms
• ViMUF
– Different approach based on similarity
– Usable on large datasets
16/16