Motion Capture Data
Similarity | Classification
PETR ELIÁŠ
0 3 / 2 0 1 5
D I S A L A B O R ATO R Y
FA C U LT Y O F I N F O R M AT I C S
M A S A R Y K U N I V E R S I T Y
Contents – Motion Capture Data
I. Introduction and Challenges
II. Classification and Current approaches
III. Our approach – Principles and Challenges
IV. Our approach – Results and Future work
2015 | PETR ELIAS | DISA | FI MU 2
Introduction
Motion Capture (MOCAP) Data
Digital approximation of motions carried out by observed subjects that are captured for further
inspection and applications.
• Digital approximation - (x, y, z) coordinate for each tracked joint and each frame (<120fps)
• Motions such as gait (walking), facial expression, interactions, whole-body actions
• Observed subjects are so far commonly individual humans
• Captured by devices based on various technologies (Kinect, OptiTrack, xSens, …)
• Inspected for analysis, action detection, action recognition, classification, reconstruction
• Applications in medicine, sports, security, entertainment (movies, games), robotics …
2015 | PETR ELIAS | DISA | FI MU 3
2015 | PETR ELIAS | DISA | FI MU 4
General challenges
• Too much information on input (complexity)
• High cost of processing the original data (efficiency)
• Feature extraction and dimension reduction (effectivity)
• Various scenarios, various lengths of motions, various data sets (adaptability)
• Applications are highly scenario-dependant
no general definition of MOCAP data similarity
no accepted universal solution for action recognition or classification
2015 | PETR ELIAS | DISA | FI MU 5
Motion Data Classification
Identifying a category/categories of observed instance
on the basis of observations whose category membership is known.
Challenges
• Different actions are performed differently by different actors
• Scope ranging from microgestures (mimics) to complex exercises (dancing)
• Relative vs absolute moves (jog vs jog on place)
• Rotation of actor (run vs run in circle)
• Various frame rates, body sizes, data quality, number of tracked joints, …
2015 | PETR ELIAS | DISA | FI MU 6
Classification approaches
Features (generally simple)
relative distances or angles between joints, most informative joints, velocity changes, absolute
coordinates, space-time occupancy, skeletal quads, covariance of 3D Joints, flexible dictionary of
action primitives, …
combined with
Classifier (generally complex)
Distance Based: Dynamic Time Warping, k-NN, … and Machine learning based: Support Vector
Machines, Neural Networks, Hidden Markov Models, Boltzman machines, …
2015 | PETR ELIAS | DISA | FI MU 7
Our approach – Main idea
1) Find effective transformation
from (dynamic) motion capture data into (static) images.
2) Classify image based on their visual similarity to others
based on known approaches (k-NN classifier on Caffe descriptors)
2015 | PETR ELIAS | DISA | FI MU 8
1) 2)
Stand up
Cartwheel
Caffe descriptor
30 %
70 %
Our approach - Motivation
• Visualization of motion data provides humans with better understanding
compared to set of high-dimensional vectors
• Comparing visual similarity of images is a known concept nowadays - it
achieves high precision and many techniques might be employed
• Instead of finding complex solution to a problem sometime it is easier to
reduce the problem into another problem that already has known solution
• Universality (scenario independance) of this approach - by selecting a proper
transformation function that visually differentiates target classification
categories
2015 | PETR ELIAS | DISA | FI MU 9
Our approach - Process
2015 | PETR ELIAS | DISA | FI MU 10
MOCAP data
1464 motions
120fps, 31 joints
15 categories
(rotate arms, punch, …)
Images
1 move = 1 image
Width = #no of frames
Height = #no of joints
Caffe Descriptors
Convolutional Neural
Network
Trained on 1.2M set of
images (mostly
photographs)
Output is 4096
dimensional vector
1s extraction
for each image
Classifier
1-NN
Weighted k-NN
Metric Space
Instances
Using MESSIF framework
CONVERT EXTRACT
IMPORTCLASSIFY
MOTIONS AS IMAGES
Our approach – Motions as Images
Every motion is a time series of (x, y, z) coordinates of all tracked joints.
2015 | PETR ELIAS | DISA | FI MU 11
JOINTS
TIME
POSE IN
TIME t140
Color of pixel (140, 23)
is given by RGB(X, Y, Z)
where X, Y, Z are the coordinates
of joint 23 in time 140
normalized* over the whole
dataset to range (0, 255).
*Minimum over all x,y,z
coordinates over all joints over all
poses over all seqeuences will get
0. Maximum 255 respectively.
JOINT j23
Our approach – Motions as images
2015 | PETR ELIAS | DISA | FI MU 12
Rotate arms
Throw right hand
Exercise
KickCartwheel
Our approach - Challenges
• Notion of time
• Various speed of performances
• Various lengths of actions
• Normalization
• Initial rotation of subject (rotate by hips, first frame, all frames)
• Centering in space (put root joint in (0,0,0), first frame, all frames)
• Human skeleton size (infant vs adult, bones size normalization)
• Range normalization (into RGB or other target space)
• Segmentation
• Action recognition in longer sequences
2015 | PETR ELIAS | DISA | FI MU 13
Normalization
I. Pose centering
Root joint to (0, 0, 0)
II. Pose rotation by angle 𝝋
Rotation along y-axis by angle 𝜑
is determined as an angle between
z-axis and straight line connecting left and right hip
in a y-projected 2D space (x,z)
III. Coordinates values normalization
Reduction to desired range such as RGB or (0, 1)
2015 | PETR ELIAS | DISA | FI MU 14
I
Results – confusion matrix
hdm05 | 1464 moves | 15 categories | 2-NN based classification | 93.17% precision
ID MOVE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 #Ns
1 cartwheel 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6
2 grabDepR 0 96 0 4 0 0 0 0 0 0 0 0 0 0 0 105
3 kick 0 0 98 2 0 0 0 0 0 0 0 0 0 0 0 49
4 move 0 0,2 0 93 0 2 0 0 0 0,5 0 0 0 0 4 430
5 punch 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 48
6 rotateArms 0 0 0 11 0 89 0 0 0 0 0 0 0 0 0 46
7 sitLieDown 0 2 0 2 0 0 95 0 0 0 0 0 0 0 0 43
8 standUp 0 0 0 2 0 0 0 95 0 0 0 0 0 0 2 43
9 throwR 0 0 0 4 0 0 0 0 96 0 0 0 0 0 0 23
10 jump 0 0 0 12 0 0 0 0 0 84 4 0 0 0 0 25
11 hopOneLeg 0 0 0 6 0 0 0 0 0 0 94 0 0 0 0 18
12 neutral 0 0 0 0 0 0 0 0 0 0 0 83 1 0 16 75
13 tpose 0 0 0 0 0 0 0 1 0 0 0 2 98 0 0 198
14 exercise 0 0 0 11 0 0 0 0 0 5 0 0 0 84 0 19
15 turn 0 0,3 0 2 0 0 0 0 0 0 0 7 0 0 91 336
2015 | PETR ELIAS | DISA | FI MU 15
Other approaches comparison
2015 | PETR ELIAS | DISA | FI MU 16
Luo, J., Wang, W., & Qi, H.
(2014). Spatio-Temporal
Feature Extraction and
Representation for RGB-D
Human Action Recognition.
Pattern Recognition Letters.
doi:10.1016/j.patrec.2014.03.
024
Summary
Advantages
• Difference between motions can be observed directly by visual comparison
• Interesting approach combining known technologies to solve challenging problem
• Potential for scenario independent solution
• Sub motion and repetitive action recognition using NN
• Quite robust and toletant to various lengths (even 50x resized images still obtain similar
precision)
Disadvantages
• No solution for segmentation
• Not suitable for online action recognition
• Computationally and time demanding computing of image descriptors (order of minutes)
2015 | PETR ELIAS | DISA | FI MU 17
Future work
•Action recognition based on segmentation
•Motion classfication using Convolutional Neural Network trained on
subset of motion images or better Convolutional Neural Network trained
on MOCAP data
•Comparison with DTW approach (centered, rotated, normalized poses)
•Optimize the speed of feature extraction – Caffe descriptor is a current
bottleneck
2015 | PETR ELIAS | DISA | FI MU 18
Sources
Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., & Bajcsy, R. (2014). Sequence of the most informative joints (SMIJ): A new representation for
human skeletal action recognition. Journal of Visual Communication and Image Representation, 25(1), 24–38.
doi:10.1016/j.jvcir.2013.04.007
Poppe, R., Van Der Zee, S., Heylen, D. K. J., & Taylor, P. J. (2014). AMAB: Automated measurement and analysis of body motion. Behavior
Research Methods, 46, 625–33. doi:10.3758/s13428-013-0398-y
Chen, X., & Koskela, M. (2013). Classification of RGB-D and Motion Capture Sequences Using Extreme Learning Machine. Image Analysis,
640–651. Retrieved from http://link.springer.com/chapter/10.1007/978-3-642-38886-6_60
Luo, J., Wang, W., & Qi, H. (2014). Spatio-Temporal Feature Extraction and Representation for RGB-D Human Action Recognition. Pattern
Recognition Letters. doi:10.1016/j.patrec.2014.03.024
Vieira, A. W., Nascimento, E. R., Oliveira, G. L., Liu, Z., & Campos, M. F. M. (2012). STOP: Space-Time Occupancy Patterns for 3D action
recognition from depth map sequences. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence
and Lecture Notes in Bioinformatics) (Vol. 7441 LNCS, pp. 252–259). doi:10.1007/978-3-642-33275-3_31
Evangelidis, G., Singh, G., & Horaud, R. (2014). Skeletal Quads : Human Action Recognition Using Joint Quadruples.
doi:10.1109/ICPR.2014.772
Hussein, M. E., Torki, M., Gowayyed, M. a., & El-Saban, M. (2013). Human action recognition using a temporal hierarchy of covariance
descriptors on 3D joint locations. IJCAI International Joint Conference on Artificial Intelligence, 2466–2472.
2015 | PETR ELIAS | DISA | FI MU 19