Sedmidubsky & Zezula
DISA
2019
FIMU
Jan Sedmidubsky Pavel Zezula
xsedmid@fi.muni.cz zezula@fi.muni.cz
Learning Features for
3D Human-Skeleton Sequences
Laboratory of Data Intensive
Systems and Applications
disa.fi.muni.cz
Learning Features for 3D Human-Skeleton Sequences March 5, 2019
Faculty of Informatics, Masaryk
University, Czech Republic
fi.muni.cz
1/30
Sedmidubsky & Zezula
DISA
2019
FIMU
Outline
Outline
1) Motion Data: Representation, Applications, Operations
2) Similarity of Motion Sequences
3) Learning Motion Features for Similarity Comparison
– CNN Features
– LSTM Features
4) Applying Learned Features for kNN Classification
5) Enhancing Feature Learning by Data Augmentation
1) Cropping/Extending/Shifting the Motion Content
2) Adding Noise to 3D Joint Coordinates
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 2/30
Sedmidubsky & Zezula
DISA
2019
FIMU
1 Motion Data
3D Skeleton Sequences ~ Motion Capture Data ~
MoCap Data ~ Motion Data
• Continuous spatio-temporal characteristics of a human
motion simplified into a discrete sequence of 3D skeletons
Synchronized
cameras
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 3/30
Sedmidubsky & Zezula
DISA
2019
FIMU
1 Motion Data
3D skeleton sequences
• Skeleton pose:
– Skeleton configuration in a given time moment
– 3D positions of body landmarks ~ joints
• Different views on motion data:
– A sequence of 3D skeleton poses
– A set of 3D trajectories of joints
Pose captured
in a given
time moment
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 4/30
Sedmidubsky & Zezula
DISA
2019
FIMU
1 Applications
Applications
• Many application domains where motion data have a great
potential to be utilized and automatically processed
– Computer animation – virtual and augmented reality
– Medicine – rehabilitation, detection of movement disorders
– Sports – assessment of performance, digital referees
– Smart-homes – detection of falls of elderly people
– Military – simulation of conflict resolving situations
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 5/30
Sedmidubsky & Zezula
DISA
2019
FIMU
1 Data – Types of Motion Sequences
Motion data types
• Short motions:
– Semantically-indivisible motions ~ ACTIONS
– Length – typically in order of seconds
– Database – usually a large number of actions
• Long motions:
– Semantically-divisible motions ~ sequences of actions
– Length – in order of minutes, hours, days, or even unlimited
– Database – typically a single long motion processed either as a
whole, or in the stream-based nature
Gait cycle
(0.6 s)
Cartwheel
(2.1 s)
Learning Features for 3D Human-Skeleton Sequences March 5, 2019
… …
Figure skating performance (3 mins)
6/30
Sedmidubsky & Zezula
DISA
2019
FIMU
Long semantically-divisible motion
… …
Long motion
…
Short motion
1 Motion-Analysis Operations
What
is it?
Classification
Pirouette (95%)
Where
is it?
Subsequence
search
Search
Figure skating performance (3 mins)
Rittberger
jump (0.4 s)
Pirouette (1.1 s)
Short semantically-indivisible motions
90%
95%
Semantic
segmentation
What is
inside?
Pirouette (97%) Rittberger (92%)
88% 96%
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 7/30
Sedmidubsky & Zezula
DISA
2019
FIMU
Similarity of actions (short motions)
• Determining similarity is needed everywhere, e.g., for
classification, searching, semantic segmentation, synthesis
– Similarity measure = features + distance function
2 Similarity of Motion Sequences
<0, 0, 5.2, 8.1, 0, 2.3, -1.1, 0, …>
Feature
extraction
process
Learning Features for 3D Human-Skeleton Sequences March 5, 2019
How similar are
the motions?
𝑑𝑖𝑠𝑡 𝐹 𝑀1
, 𝐹 𝑀2
= 8.56
M1 M2
𝐹 𝑀1
<0, 0.1, 6.2, 7.1, 0, 2.9, -2.1, 0, …>
𝐹 𝑀2
8/30
Sedmidubsky & Zezula
DISA
2019
FIMU
2 Challenges of Similarity Measures
Objective
• To propose an effective and efficient similarity measure, i.e.,
features + distance function
Problems
• Similarity is application-dependent (e.g., recognizing daily
actions vs. recognizing people based on their style of walking)
• Subjects have different bodies (e.g., child vs. adult)
• Spatial and temporal deformations – the same action (e.g.,
kick) can be performed at different:
– Styles (e.g., frontal kick vs. side kick) and
– Speeds (e.g., faster vs. slower)
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 9/30
Sedmidubsky & Zezula
DISA
2019
FIMU
2 Data Normalization
Preprocessing step – data normalization
• Optional step depending on a target application
• Types – skeleton size and joint position and orientation
• Normalizing each pose independently vs. conditionally
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 10/30
Sedmidubsky & Zezula
DISA
2019
FIMU
3 Feature Extraction
Features
• Hand-crafted features – manual feature engineering
– Low descriptive power – outperformed by ML approaches
– E.g., time series of joint angle rotations compared by DTW
• Machine-learned features – learning features automatically
– Large amount of training data needed
– E.g., CNN, RNN, LSTM features:
• 16–256D float vectors compared by the Euclidean distance
[Coskun et al.: Human Motion Analysis with Deep Metric Learning. ECCV, 2018]
• ?D float vectors compared by the Euclidean distance
[Aristidou et al.: Deep Motifs and Motion Signatures, ACM Trans. Graph., 2018]
• 4,096D float vectors compared by the Euclidean distance
[Sedmidubsky et al.: Effective and efficient similarity searching in motion capture data. MTAP, 2018]
• 160D bit vectors compared by the Hamming distance
[Wang et al.: Deep signatures for indexing and retrieval in large motion databases. Motion in Games, 2015]
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 11/30
Sedmidubsky & Zezula
DISA
2019
FIMU
3 Motion-Image Similarity Measure
Motion-image similarity measure (CNN features)
[Sedmidubsky et al.: Effective and efficient similarity searching in motion capture data. Multimedia Tools and Apps, 2018]
• Deep 4,096D features compared by the Euclidean distance
• Suitable for short motions in order of seconds (~ actions)
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 12/30
Sedmidubsky & Zezula
DISA
2019
FIMU
Recurrent Neural Networks (RNN)
• Output contents are influenced by the history of inputs
• Long-Short Term Memory (LSTM) network:
– A special kind of RNN, capable of learning long-term dependencies
– It learns when data should be remembered and when they should
be thrown away
3 Recurrent Neural Networks
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 13/30
Sedmidubsky & Zezula
DISA
2019
FIMU
3 LSTM-based Similarity Measure
LSTM-based similarity measure (LSTM features)
• Number of states/cells corresponds to the number of poses
• The last state ht+1 can be used as a feature
• Size of each state hi is a user-defined parameter
– Suitable state size of 512 / 1,024 / 2,048 dimensions
Learning Features for 3D Human-Skeleton Sequences March 5, 2019
0.1
0.0
…
0.7
0.2
Result
classification
Projection +
Softmax
#of
classes
14/30
Sedmidubsky & Zezula
DISA
2019
FIMU
3 Training CNN/LSTM Models
Training CNN/LSTM models for
classification and/or feature extraction
• Training a neural network model (CNN/LSTM)
– Labelled training and validation data ~ actions categorized in classes
– Training in epochs (usually hundreds of epochs)
– Result model taken from the epoch achieving the highest accuracy on
validation data
Learning Features for 3D Human-Skeleton Sequences March 5, 2019
Classified
dataset
Training
data
Validation
data
Training
(Fine-
tuning)
Validation
Neural network
model
Data
split
15/30
Sedmidubsky & Zezula
DISA
2019
FIMU
3 Classification Dataset
HDM05 dataset (120 Hz sampling, 31 body joints)
• Ground truth – 2,328/2,345 actions in 122/130 classes
– Shortest and longest actions: 13 frames (0.1s) and 900 frames (7.5s)
– Action classes corresponding to daily/exercising activities, e.g.:
• “Clap with hands 5 times”, “Walk two steps, starting with left leg”, “Turn left”,
“Frontal kick by left leg two times”, “Cartwheel, starting with left hand”
Learning & validation
• n-fold cross validation procedure – a standard statistical
method to estimate the skill of machine learning models
• We adopt 2-fold cross validation – dataset split into 2 folds,
either randomly or in a balanced way:
1) 1st fold used for training, 2nd fold used for validation
2) 2nd fold used for training, 1st fold used for validation
Learning Features for 3D Human-Skeleton Sequences March 5, 2019
make
average
16/30
Sedmidubsky & Zezula
DISA
2019
FIMU
3 Classification Dataset – Training
Training CNN/LSTM models for feature extraction
• Training times:
– Training time linearly depends on the number of training actions
– CNN (through motion images):
• 1,164 actions ~ 2 hours
– LSTM (influenced by parameters, e.g., feature size, action length, fps):
• 1,164 actions ~ 1 hour
• 10,476 actions ~ 9 hours
• 20,952 actions ~ 18 hours
• Learned features:
– CNN: 4,096D features + Euclidean distance
– LSTM: 1,024D features + Manhattan distance
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 17/30
Sedmidubsky & Zezula
DISA
2019
FIMU
Unlabeled action
4 Applying Learned Features for Action
Recognition
What
is it?
Similarity
search
Pirouette (95%)
Rittberger
jump (0.4 s)
Pirouette (1.1 s)
Reference collection of actions
Reference collection
• Categorized features of training
actions, obtained using the CNN/
LSTM neural network model
Similarity search
• Searching for the k-nearest actions
to the unlabeled action, simply
using the sequential scan
Classification
• Recognizing the unlabeled action
class using the 1NN or Weighteddistance
kNN classifier
Learning Features for 3D Human-Skeleton Sequences March 5, 2019
Classification
18/30
Sedmidubsky & Zezula
DISA
2019
FIMU
1NN classification
• Searching for the nearest neighbor
to the unlabeled action
• Class of the nearest neighbor
considered as class of the query
4 1NN Classification
JUMP class
feature vectors
<…, 0.53, 10.8, 4.64, …>
<…, 0.12, 8.60, 1.99, …>
KICK class
feature vectors
<…, 8.93, 10.1, 2.43, …>
<…, 7.42, 7.14, 2.27, …>
<…, 3.93, 6.26, 3.41, …>
Unlabeled action
feature vector
<…, 0.93, 10.1, 2.43, …>
1. 8.7 JUMP
2. 10.9 KICK
3. 13.2 KICK
4. 14.3 KICK
 JUMP (100%)
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 19/30
Sedmidubsky & Zezula
DISA
2019
FIMU
4 kNN Classification
Weighted-distance kNN classifier
[Sedmidubsky et al.: Probabilistic Classification of Skeleton Sequences. DEXA, 2018]
• Considering not only the number of votes but also the
similarity of neighbors
– Normalizing the neighbor distance with respect to the k-th neighbor
• Effective when distances of nearest neighbors vary across classes
– Computing class relevance by summing relevance of class neighbors
(1 – normalized distance)
Original
distances
1. 8.7 JUMP
2. 10.9 KICK
3. 13.2 KICK
4. 14.3 KICK
Normalized
distances
1. 0.55 JUMP
2. 0.69 KICK
3. 0.84 KICK
4. 0.91 KICK
Relevance
of neighbors
1. 0.45 JUMP
2. 0.31 KICK
3. 0.16 KICK
4. 0.09 KICK
 JUMP (45%)
 KICK (55%)
Relevance
of classes
0.45 JUMP
0.56 KICK
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 20/30
Sedmidubsky & Zezula
DISA
2019
FIMU
• Data: 2,328 actions in 122 classes
– 1,164 training and 1,164 test actions
– Fold-splitting strategies:
• Random
• Balanced
4 CNN/LSTM Recognition Results –
Influence of Data Splitting Strategy
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 21/30
Sedmidubsky & Zezula
DISA
2019
FIMU
Augmentation of training data
• Increasing the number of training actions artificially
– Inspiration in the image domain – image flips, blurry images
• Proposed action augmentation techniques:
1) Shifting/cropping/extending original actions
2) Adding noise to 3D joint coordinates of original actions
3) Combination of both the above techniques
5 Augmentation of Training Data
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 22/30
Sedmidubsky & Zezula
DISA
2019
FIMU
Augmentation (1)
• Shift – shifting boundaries of
original actions within parent
sequences
• Crop – cropping original actions
• Extension – extending
boundaries of original actions
within parent sequences
5 Shifting/Cropping/Extending Original
Actions
Learning Features for 3D Human-Skeleton Sequences March 5, 2019
Original action
Parent sequence
Shifted action
…
shift of 20% w.r.t. action length
…
Original action
Parent sequence
Extended action
…
extension of 20% w.r.t. action length
(10% on left and 10% on right side)
…
Original action
Parent sequence
Cropped action
…
crop of 20% w.r.t. action length
(10% on left and 10% on right side)
…
23/30
Sedmidubsky & Zezula
DISA
2019
FIMU
Training data augmentation (1) – shift/crop/extension
• no augmentation – 1,164 actions for learning; 1,164 reference actions
• augPart+TRaug – 5,820 actions for learning; 5,820 reference actions
– Each action in 5 variants: original; 10/20% shift (left + right)
• augAll+TRaug – 10,476 actions for learning; 10,476 reference actions
– Each action in 9 variants: original; 10/20% shift (left + right); 10/20% crop and extension
5 Shifting/Cropping/Extending Original
Actions – TRaug Results
Learning Features for 3D Human-Skeleton Sequences March 5, 2019
Best results:
no augmentation = 91.41%
augPart+TRaug = 92.14%
augAll+TRaug = 93.00%
24/30
Sedmidubsky & Zezula
DISA
2019
FIMU
Training data augmentation (1) – shift/crop/extension
• no augmentation – 1,164 actions for learning; 1,164 reference actions
• augPart – 5,820 actions for learning; 1,164 reference actions
– Each action in 5 variants: original; 10/20% shift (left + right)
• augAll – 10,476 actions for learning; 1,164 reference actions
– Each action in 9 variants: original; 10/20% shift (left + right); 10/20% crop and extension
5 Shifting/Cropping/Extending Original
Actions – Results
Learning Features for 3D Human-Skeleton Sequences March 5, 2019
Best results:
no augmentation = 91.41%
augPart+TRaug = 92.14%
augAll+TRaug = 93.00%
augPart = 92.87%
augAll = 93.34%
25/30
Sedmidubsky & Zezula
DISA
2019
FIMU
Augmentation (2)
• Random joint coordinate noise – moving
each joint coordinate to a new 3D position
– MRT – max relative move threshold (e.g., 5cm)
– Joint coordinate moved in each of x/y/z axis by
a random value from [0, MRT]
5 Adding Noise to 3D Joint Coordinates
Learning Features for 3D Human-Skeleton Sequences March 5, 2019
Original elbow
position
New elbow
position
26/30
Sedmidubsky & Zezula
DISA
2019
FIMU
Training data augment. (2) – noise in joint coords
• no augmentation – 1,164 actions for learning; 1,164 reference actions
• aug0-rjc – 5,820 actions for learning; 1,164 reference actions
– Each action in 5 variants: original; MRT of ~ 0.6, 1.3, 2.5 and 5 cm
• aug0-rjc+TRaug – 5,820 actions for learning; 5,820 reference actions
– Each action in 5 variants: original; MRT of ~ 0.6, 1.3, 2.5 and 5 cm
5 Adding Noise to 3D Joint Coordinates
– Results
Learning Features for 3D Human-Skeleton Sequences March 5, 2019
Best results:
no augmentation = 91.41%
aug0-rjc+TRaug = 92.83%
aug0-rjc = 92.70%
27/30
Sedmidubsky & Zezula
DISA
2019
FIMU
Training data augmentation (3) – combination
• no augmentation – 1,164 actions for learning; 1,164 reference actions
• augAll+TRaug – 10,476 actions for learning; 10,476 reference actions
– Each action in 9 variants: original; 10/20% shift/crop/extension with MRT of 2.5 cm
• augAll – 10,476 actions for learning; 1,164 reference actions
– Each action in 9 variants: original; 10/20% shift/crop/extension with MRT of 2.5 cm
5 Combination of Data Augmentation
Learning Features for 3D Human-Skeleton Sequences March 5, 2019
Best results:
no augmentation = 91.41%
augAll+rjc+TRaug = 93.94%
augAll+rjc = 94.33%
28/30
Sedmidubsky & Zezula
DISA
2019
FIMU
State-of-the-art comparison
• HDM05 dataset 2,328/2,345 samples in 122/130 classes
• 2-fold cross validation (50% of training data)
5 Comparison to the State-of-the-Art
Results
Method
Accuracy (%)
HDM-122 HDM-130
Relatedapproaches
Huang et al. (2016) N/A 75.78
Laraba et al. (2017) N/A 83.33
Li et al. (2018) N/A 86.17
1NN on 4kMI (2017) 87.24 86.79
1NN on 4kMIE (2017) 87.84 87.38
Confusion-based 15NN_TCS on 4kMIE (2018) 89.09 88.78
LSTM features on balanced splits (2019) 91.41
LSTM features on balanced splits + augmented tr. data (2019) 94.33
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 29/30
Sedmidubsky & Zezula
DISA
2019
FIMU
Conclusions
Observations
• LSTM features outperform CNN features in both
effectiveness and efficiency
– LSTM features can be parametrized in many ways (e.g., size of
feature, size of embedding)
• Splitting training data in a balanced way increases the
recognition accuracy a lot
– 88.66% => 91.41% (decrease in classification error of 24%)
• Augmenting less-populated classes of training data
increases the recognition accuracy a lot
– 91.41% => 94.33% (decrease in error of 34% vs. no augmentation)
– 89.09% => 94.33% (decrease in error of 48% vs. state-of-the-art result)
Learning Features for 3D Human-Skeleton Sequences March 5, 2019 30/30