Motion Capture Data Similarity | Classification PETR ELIÁŠ 0 3 / 2 0 1 5 D I S A L A B O R ATO R Y FA C U LT Y O F I N F O R M AT I C S M A S A R Y K U N I V E R S I T Y Contents – Motion Capture Data I. Introduction and Challenges II. Classification and Current approaches III. Our approach – Principles and Challenges IV. Our approach – Results and Future work 2015 | PETR ELIAS | DISA | FI MU 2 Introduction Motion Capture (MOCAP) Data Digital approximation of motions carried out by observed subjects that are captured for further inspection and applications. • Digital approximation - (x, y, z) coordinate for each tracked joint and each frame (<120fps) • Motions such as gait (walking), facial expression, interactions, whole-body actions • Observed subjects are so far commonly individual humans • Captured by devices based on various technologies (Kinect, OptiTrack, xSens, …) • Inspected for analysis, action detection, action recognition, classification, reconstruction • Applications in medicine, sports, security, entertainment (movies, games), robotics … 2015 | PETR ELIAS | DISA | FI MU 3 2015 | PETR ELIAS | DISA | FI MU 4 General Challenges • Too much information on input (complexity) • High cost of processing the original data (efficiency) • Feature extraction and dimension reduction (effectivity) • Various scenarios, various lengths of motions, various data sets (adaptability) • Applications are highly scenario-dependant no general definition of MOCAP data similarity no accepted universal solution for action recognition or classification 2015 | PETR ELIAS | DISA | FI MU 5 Motion Data Classification Identifying a category/categories of observed instance on the basis of observations whose category membership is known. Challenges • Different actions are performed differently by different actors • Scope ranging from microgestures (mimics) to complex exercises (dancing) • Relative vs absolute moves (jog vs jog on place) • Rotation of actor (run vs run in circle) • Various frame rates, body sizes, data quality, number of tracked joints, … 2015 | PETR ELIAS | DISA | FI MU 6 Classification Approaches Features (generally simple) relative distances or angles between joints, most informative joints, velocity changes, absolute coordinates, space-time occupancy, skeletal quads, covariance of 3D Joints, flexible dictionary of action primitives, … combined with Classifier (generally complex) Distance Based: Dynamic Time Warping, k-NN, … and Machine learning based: Support Vector Machines, Neural Networks, Hidden Markov Models, Boltzman machines, … 2015 | PETR ELIAS | DISA | FI MU 7 Our Approach – Main Idea 1) Find effective transformation from (dynamic) motion capture data into (static) images. 2) Classify image based on their visual similarity to others based on known approaches (k-NN classifier on Caffe descriptors) 2015 | PETR ELIAS | DISA | FI MU 8 1) 2) Stand up Cartwheel Caffe descriptor 30 % 70 % Our Approach – Motivation • Visualization of motion data provides humans with better understanding compared to set of high-dimensional vectors • Comparing visual similarity of images is a known concept nowadays - it achieves high precision and many techniques might be employed • Instead of finding complex solution to a problem sometime it is easier to reduce the problem into another problem that already has known solution • Universality (scenario independance) of this approach - by selecting a proper transformation function that visually differentiates target classification categories 2015 | PETR ELIAS | DISA | FI MU 9 Our Approach – Process 2015 | PETR ELIAS | DISA | FI MU 10 MOCAP data hdm05, 1464 motions 120fps, 31 joints 15 categories (rotate arms, punch, …) Images 1 move = 1 image Width = #no of frames Height = #no of joints Caffe Descriptors Convolutional Neural Network Trained on 1.2M set of images (mostly photographs) Output is 4096 dimensional vector 1s extraction for each image Classifier 1-NN Weighted k-NN Metric Space Instances Using MESSIF framework CONVERT EXTRACT IMPORTCLASSIFY MOTIONS AS IMAGES Our Approach – Motions as Images Every motion is a time series of (x, y, z) coordinates of all tracked joints. 2015 | PETR ELIAS | DISA | FI MU 11 JOINTS TIME POSE IN TIME t140 Color of pixel (140, 23) is given by RGB(X, Y, Z) where X, Y, Z are the coordinates of joint 23 in time 140 normalized* over the whole dataset to range (0, 255). *Minimum over all x, y, z coordinates over all joints over all poses over all seqeuences will get 0. Maximum 255 respectively. JOINT j23 Our Approach – Motions as Images 2015 | PETR ELIAS | DISA | FI MU 12 Rotate arms Throw right hand Exercise KickCartwheel Our Approach – Challenges • Notion of time • Various speed of performances • Various lengths of actions • Normalization • Initial rotation of subject (rotate by hips, first frame, all frames) • Centering in space (put root joint in (0, 0, 0), first frame, all frames) • Human skeleton size (infant vs adult, bones size normalization) • Range normalization (into RGB or other target space) • Segmentation • Action recognition in longer sequences 2015 | PETR ELIAS | DISA | FI MU 13 Normalization I. Pose centering Root joint to (0, 0, 0) II. Pose rotation by angle 𝝋 Rotation along y-axis by angle 𝜑 is determined as an angle between z-axis and straight line connecting left and right hip in a y-projected 2D space (x, z) III. Coordinates values normalization Reduction to desired range such as RGB or (0, 1) 2015 | PETR ELIAS | DISA | FI MU 14 I Results – Confusion Matrix hdm05 | 1464 motions | 15 categories | 1-NN classification | 93.17% precision ID MOVE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 #Ns 1 cartwheel 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 2 grabDepR 0 96 0 4 0 0 0 0 0 0 0 0 0 0 0 105 3 kick 0 0 98 2 0 0 0 0 0 0 0 0 0 0 0 49 4 move 0 0,2 0 93 0 2 0 0 0 0,5 0 0 0 0 4 430 5 punch 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 48 6 rotateArms 0 0 0 11 0 89 0 0 0 0 0 0 0 0 0 46 7 sitLieDown 0 2 0 2 0 0 95 0 0 0 0 0 0 0 0 43 8 standUp 0 0 0 2 0 0 0 95 0 0 0 0 0 0 2 43 9 throwR 0 0 0 4 0 0 0 0 96 0 0 0 0 0 0 23 10 jump 0 0 0 12 0 0 0 0 0 84 4 0 0 0 0 25 11 hopOneLeg 0 0 0 6 0 0 0 0 0 0 94 0 0 0 0 18 12 neutral 0 0 0 0 0 0 0 0 0 0 0 83 1 0 16 75 13 tpose 0 0 0 0 0 0 0 1 0 0 0 2 98 0 0 198 14 exercise 0 0 0 11 0 0 0 0 0 5 0 0 0 84 0 19 15 turn 0 0,3 0 2 0 0 0 0 0 0 0 7 0 0 91 336 2015 | PETR ELIAS | DISA | FI MU 15 Other Approach Comparison 2015 | PETR ELIAS | DISA | FI MU 16 Luo, J., Wang, W., & Qi, H. (2014). Spatio-Temporal Feature Extraction and Representation for RGB-D Human Action Recognition. Pattern Recognition Letters. doi:10.1016/j.patrec.2014.03. 024 Summary Advantages • Difference between motions can be observed directly by visual comparison • Interesting approach combining known technologies to solve challenging problem • Potential for scenario independent solution • Sub motion and repetitive action recognition using NN • Quite robust and toletant to various lengths (even 50x resized images still obtain similar precision) Disadvantages • No solution for segmentation • Not suitable for online action recognition • Computationally and time demanding computing of image descriptors (order of minutes) 2015 | PETR ELIAS | DISA | FI MU 17 Future Work • Action recognition based on segmentation • Motion classfication using Convolutional Neural Network trained on subset of motion images or better Convolutional Neural Network trained on MOCAP data • Comparison with DTW approach (centered, rotated, normalized poses) • Optimize the speed of feature extraction – Caffe descriptor is a current bottleneck 2015 | PETR ELIAS | DISA | FI MU 18 Sources Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., & Bajcsy, R. (2014). Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition. Journal of Visual Communication and Image Representation, 25(1), 24–38. doi:10.1016/j.jvcir.2013.04.007 Poppe, R., Van Der Zee, S., Heylen, D. K. J., & Taylor, P. J. (2014). AMAB: Automated measurement and analysis of body motion. Behavior Research Methods, 46, 625–33. doi:10.3758/s13428-013-0398-y Chen, X., & Koskela, M. (2013). Classification of RGB-D and Motion Capture Sequences Using Extreme Learning Machine. Image Analysis, 640–651. Retrieved from http://link.springer.com/chapter/10.1007/978-3-642-38886-6_60 Luo, J., Wang, W., & Qi, H. (2014). Spatio-Temporal Feature Extraction and Representation for RGB-D Human Action Recognition. Pattern Recognition Letters. doi:10.1016/j.patrec.2014.03.024 Vieira, A. W., Nascimento, E. R., Oliveira, G. L., Liu, Z., & Campos, M. F. M. (2012). STOP: Space-Time Occupancy Patterns for 3D action recognition from depth map sequences. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7441 LNCS, pp. 252–259). doi:10.1007/978-3-642-33275-3_31 Evangelidis, G., Singh, G., & Horaud, R. (2014). Skeletal Quads : Human Action Recognition Using Joint Quadruples. doi:10.1109/ICPR.2014.772 Hussein, M. E., Torki, M., Gowayyed, M. a., & El-Saban, M. (2013). Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. IJCAI International Joint Conference on Artificial Intelligence, 2466–2472. 2015 | PETR ELIAS | DISA | FI MU 19 Our approach formally We denote the k -th body pose in a motion sequence of length 𝑇 as a vector: 𝑝 𝑡 = 𝑗1 𝑡 , … , 𝑗 𝑛 𝑡 with t ∈ 1, … , 𝑇 for a recording with 𝑇 frames. Each component 𝑗𝑖 𝑡 of the vector 𝑝 𝑡 corresponds to a joint 𝑖 ∈ 1, … , 𝑛 position measurement, and is denoted by a triplet (x, y, z). 2015 | PETR ELIAS | DISA | FI MU 20 Our approach formally (2) Let 𝑠 = {𝑝 𝑡1, 𝑝 𝑡2, … 𝑝 𝑡 𝑇} be a sequence of poses constituting some motion and let 𝑖𝑚𝑔 𝛾(𝛼,𝛽) 𝜖 𝛼 × 𝛽 × 𝛾 𝛼, 𝛽 𝛼 ∈ 1, … , 𝑚𝑎𝑥𝑊𝑖𝑑𝑡ℎ , 𝛽 ∈ 1, … , 𝑚𝑎𝑥𝐻𝑒𝑖𝑔ℎ𝑡 , 𝛾 ∈ 𝑅𝐺𝐵 be an image of size 𝑚𝑎𝑥𝑊𝑖𝑑𝑡ℎ × 𝑚𝑎𝑥𝐻𝑒𝑖𝑔ℎ𝑡 and 𝛾 𝛼, 𝛽 is an information how to color pixel at position 𝛼, 𝛽 . Finally we seek to find a transformation function 𝜑: 𝑇𝐼𝑀𝐸 × 𝐽𝑂𝐼𝑁𝑇𝑆 × ℝ3 → ℕ2 × 𝑅𝐺𝐵 Such that 𝜑 𝑠 = 𝑖𝑚𝑔 𝛾(𝛼,𝛽) 2015 | PETR ELIAS | DISA | FI MU 21