Research in Processing Human Motion from Video Data using Deep Learning Jan Sedmidubský Department of Machine Learning and Data Processing Masaryk University Brno, Czechia Laboratory of Data Intensive Systems and Applications disa.fi.muni.cz “persons crossing the street” deep learning model Digitization of human motion Research in Processing Human Motion from Video Data using Deep Learning • Skeleton-data representation • Simplified spatio-temporal representation of human motion • Sequence of 3D skeletons ~ a set of 3D trajectories of key body joints • Better structured and easier to store than video-based representation Source: https://blog.usejournal.com/3d-human-pose-estimation-ce1259979306 Video-based representation Skeleton-based representation 2/26 Skeleton data Research in Processing Human Motion from Video Data using Deep Learning Skeleton-based representation • Sequence S = (P1, …, Pn) of skeleton poses P1, …, Pn • Pose Pi = (c1, …, cm) consists of 2D/3D coordinates c1, …, cm of selected key body points, that usually correspond to significant joints • Bones are just “artificial lines” between pairs of key points 7.2 7.3 7.4 7.2 6.9 … 5.4 0.9 0.9 0.8 0.7 0.8 … 0.7 2.4 2.6 2.8 2.5 2.5 … 2.2 … … … … … … 9.0 8.9 9.1 9.3 9.6 … 8.2 Matrix representation of 3D skeleton data P1 P2 P3 P4 P5 Pn c1c2 c3 c4 c31 c1 x y z cm z … Body model with m = 31 joints Trajectoryofc1 3/26 Capturing technologies Research in Processing Human Motion from Video Data using Deep Learning • Acquisition of skeleton data Optical sensors (e.g., Vicon) Inertial sensors (e.g., xSens) RGB + depth sensors (e.g., Kinect, Xtion) Ordinary video camera (e.g., HRnet, STAF, XNect) Type Technologies Sensors Joints Framerate Error Cost Mobility Invasivity Optical sensors Vicon, OptiTrack, Qualisys 10–40 22–32 120–420 mm $$$ – Markers Inertial sensors Xsens, Vicon ~20 ~20 ~120 mm–cm $$ ✓ Sensors RGB + depth sensors Kinect, Xtion 3 25 ~30 >cm $ – – Pose estimation OpenPose, STAF, XNect 1 14–16 ~video >cm – ✓ – 2000 2023 4/26 Pose estimation example Research in Processing Human Motion from Video Data using Deep Learning5/26 Great application potential Research in Processing Human Motion from Video Data using Deep Learning • A wide variety of possible application domains • Sports – digital referees assessing the quality of performance • Virtual reality – recognizing player movements in real time • Smart-cities – detecting falls of (elderly) people • Healthcare – evaluating the rehabilitation progress remotely Source: https://blog.usejournal.com/3d-human-pose-estimation-ce1259979306Source: https://www.youtube.com/watch?v=5cI-JibDEMA 6/26 Example application – analysis of speedclimbing performances Research in Processing Human Motion from Video Data using Deep Learning7/26 Research objectives Research in Processing Human Motion from Video Data using Deep Learning • Research objective • Effective and efficient content-based access to skeleton data to make them “findable” and thus reusable • Content-based access operations: • Searching | Subsequence searching • Action recognition (classification) | Action detection • Motion generation (synthesis) • Challenges: Data complexity | Similarity-based comparison | Data volume Similar? DB 8/26 Similarity models – handcrafted Research in Processing Human Motion from Video Data using Deep Learning • Similarity of actions • Actions (sequences of poses) have different lengths → similarity can be determined by time-warping functions • Dynamic Time Warping (DTW) – quadratic time complexity O(n ∙ n’) • Uniform Time Warping (UTW) – linear time complexity O(n + n’) • Time warping requires a pose-based similarity function S = (P1, …, Pn) S’ = (P’1, …, P’n’) P’n’ Pn P1 P’1 dist(S, S’) ~ DTW(S, S’) 𝒅𝒊𝒔𝒕(S, S’) → ℝ 𝟎 + 𝒑𝒐𝒔𝒆𝑫𝒊𝒔𝒕(𝑃𝑖,𝑃′𝑗) → ℝ 𝟎 + 𝒑𝒐𝒔𝒆𝑫𝒊𝒔𝒕(𝑃2,𝑃′7) dist(S, S’) ~ UTW(S, S’) 7.2 7.3 7.4 7.2 6.9 … 5.4 0.9 0.9 0.8 0.7 0.8 … 0.7 2.4 2.6 2.8 2.5 2.5 … 2.2 … … … … … … 9.0 8.9 9.1 9.3 9.6 … 8.2 Matrix representation of 3D skeleton data P1 P2 P3 P4 P5 Pn c1 x y z cm z … 9/26 Similarity models – handcrafted Research in Processing Human Motion from Video Data using Deep Learning • Similarity of poses on raw joint coordinates • Optional pre-processing step – normalization of coordinates: • Normalization of position, orientation, and skeleton size • Several possibilities, e.g., sum of the Euclidean distances between corresponding 3D joint coordinates: poseDist(Pi, P’j) = ෍ 𝑐=1 𝑚 𝑃𝑖 𝑐 − 𝑃′𝑗 𝑐 𝑃𝑖 𝑃′𝑗 𝑃′𝑗 7 𝑃𝑖 7 10/26 Similarity models – deep learning Research in Processing Human Motion from Video Data using Deep Learning • Handcrafted models hardly capture semantics of skeleton data • A variety of deep-learning architectures for learning semantics: • 2D/3D Convolutional neural networks (CNNs) • Recurrent neural networks (RNNs) • Graph convolutional networks (GCNs) • Transformers • Learning: • Supervised (~classifiers) • Self-supervised • Unsupervised 11/26 Similarity models – motion images + CNN Research in Processing Human Motion from Video Data using Deep Learning • Input skeleton data transformed into a 2D motion image • Training a CNN to learn semantic motion features (fixed-size vectors) • Example of training – training for classification in a supervised way • Similarity – features efficiently compared by the Euclidean/Cosine distance r output motion classes walking (90%) Some CNN C1 Cr 7.2 7.3 7.4 7.2 6.9 … 5.4 0.9 0.9 0.8 0.7 0.8 … 0.7 2.4 2.6 2.8 2.5 2.5 … 2.2 … … … … … … 9.0 8.9 9.1 9.3 9.6 … 8.2 c1 x y z cm z … last hidden layer ~ semantic motion feature 12/26 Similarity models – motion images + CNN Research in Processing Human Motion from Video Data using Deep Learning • Construction of motion images: • Cover all skeleton poses by a virtual 3D cube • Split the cube into 256x256x256 cells and assign a color to each cell based on the RGB color space (16.8M colors) => 3D joint position is approximated by a specific color => Spatially similar joint coordinates get similar colors 7.2 7.3 7.4 7.2 6.9 … 5.4 0.9 0.9 0.8 0.7 0.8 … 0.7 2.4 2.6 2.8 2.5 2.5 … 2.2 … … … … … … 9.0 8.9 9.1 9.3 9.6 … 8.2 Matrix representation of 3D skeleton data P1 P2 P3 P4 P5 Pn c1 x y z cm z … 𝑅𝐺𝐵 𝑐𝑖 = 255 𝑚𝑎𝑥 − 𝑚𝑖𝑛 ∙ 𝑐𝑖[𝑥] − 𝑚𝑖𝑛 𝑐𝑖 𝑦 − 𝑚𝑖𝑛 𝑐𝑖[𝑧] − 𝑚𝑖𝑛 13/26 Similarity models – motion images + CNN Research in Processing Human Motion from Video Data using Deep Learning • Transforming skeleton data into a 2D motion image • Training a CNN using 2D motion images • Resizing motion images to a fixed size (e.g., 224x224 pixels) • Temporal deformations – slower/faster action executions become very similar right leg right hand left hand left leg torso Time Joints 14/26 Similarity models – RNNs Research in Processing Human Motion from Video Data using Deep Learning • An RNN with GRU/LSTM cells – suitable for learning temporal data • Number of states/cells corresponds to the number of poses • Semantic feature vector – output of the last cell (ht+1) • Size of each state hi is a user-defined parameter (e.g., 512 dimensions) • Features compared by the Euclidean/Cosine distance Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 15/26 Similarity models – RNNs Research in Processing Human Motion from Video Data using Deep Learning • Examples of training: • Supervised (for classification) • Self-supervised using pairs of similar/dissimilar actions 16/26 Searching – query-by-example paradigm Research in Processing Human Motion from Video Data using Deep Learning • Query-by-example searching • The most fundamental operation – finding the k (k ∈ ℕ) database motions that are the most similar to a query motion • Main challenges: • Effective similarity model to compare the query and database motion • Efficient retrieval algorithm to provide the k most similar motions Query motion What are the 3 most similar motions? Database of motions 17/26 Searching – query-by-example paradigm Research in Processing Human Motion from Video Data using Deep Learning • Query-by-example searching • Transforming complex motions to fixed-size feature vectors (~effectiveness) • E.g., CNN-based or RNN-based features • Indexing feature vectors (~efficiency) • E.g., FAISS or PPP-codes <…, 0.5, 1.1, 9.6, …> <…, 0.1, 3.4, 6.8, …> <…, 0.6, 2.9, 7.7, …> Deep feature extraction Indexing database features Fixed-size features 18/26 Searching – text-to-motion paradigm Research in Processing Human Motion from Video Data using Deep Learning • Text-to-motion searching: • Idea – replace the motion example query by a text query • Text query specified by a natural language description 19/26 Searching – text-to-motion paradigm Research in Processing Human Motion from Video Data using Deep Learning • Learning a common text-motion space 3D skeleton sequence (motion) modality Text modality Common embedding space “A person is walking and making a handstand” Motion encoder (e.g., RNN- based) Text encoder (e.g., BERT or CLIP) i-th training motion-text pair (Mi, Ci): mi – motion feature vector ci – motion description feature vector (m1, c1) (m2, c2) … (m128, c128) batch B=128 mi ci Mi Ci 20/26 Subsequence searching Research in Processing Human Motion from Video Data using Deep Learning • Query-by-example subsequence searching • Inspecting the long database motions to find their k subsequences that are the most similar to a query motion • Additional challenge to the search task: • Efficient retrieval algorithm to localize candidate subsequences Query motion What are the 3 most similar sub-motions? Database of long motions 21/26 Subsequence searching Research in Processing Human Motion from Video Data using Deep Learning • Possible solution: • Partitioning long database motions into many overlapping segments of different sizes • Indexing segments of the same size within a single index structure • Searching in a “suitable” index for the most query-similar segments #1 #2 #3 <0.5,1.1,9.6,…>,<…>,<…>,… <0.1,3.4,6.8,…>,<…>,<…>,… <0.6,2.9,7.7,…>,<…>,<…>, … Deep feature extraction Indexing segment features … … … DEMO 22/26 Action detection Research in Processing Human Motion from Video Data using Deep Learning • Detecting actions (events) in skeleton-data streams • Processing a motion stream and detecting its subsequences that correspond to the provided action classes • Main challenges: • Learning effective representations of individual action classes • Identifying beginnings and endings of to-be-detected actions Punch Database of action classes Handstand Kick What kinds of motions are happening? … Kick Past Future 23/26 Action detection Research in Processing Human Motion from Video Data using Deep Learning • Detecting actions in streams – segment-based principle • Applying the “subsequence-search” segmentation • Determining similarity between segments and action classes • Merging segments Punch Database of action classes Handstand Kick Kick Past Future #1 #2 #3 … … … 24/26 Action detection Research in Processing Human Motion from Video Data using Deep Learning • Detecting actions in streams – frame-based principle • Adopted LSTM-based recurrent neural network to estimate a probability for each class and frame Not likely Very likely Kick Punch Handstand Past Future Kick KickPunch 25/26 Future research Research in Processing Human Motion from Video Data using Deep Learning • Paradigm shift: • Possible topics: • Indexing mechanisms for very large skeleton-data collections • Explainability of similarity models (e.g., of LSTM-based models) • New retrieval models (e.g., regex-based or text-to-motion models) • Analyzing interactions of more persons (e.g., detection of groups of people) 26/26