CLASSIFICATION OF HUMAN ACTIONS USING NON

International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
CLASSIFICATION OF HUMAN ACTIONS USING NON-LINEAR SVM BY
EXTRACTING SPATIO TEMPORAL HOG FEATURES WITH CUSTOM
DATASET
Yashaswini Mahesh1 , Dr. M. Shivakumar2 , Dr. H.S.Mohana3
1
Dept. of ECE, GSSSIT, Mysuru, toyashaswini@gmail.com,
Dept. of TCE, 3 Dept. IT, GSSIT, Mysuru, hodte@gsss.edu.in
3
Malnad College of Engineering, Hassan , hsm@mcehassan.ac.in
2
ABSTRACT
Local space-time features capture local events in a video sequence; these features help to extract motion
descriptors which capture motion sequence in the video. This paper proposes that the actions of human body are
recognized by detecting interest points called Spatio-temporal interest point (STIP). A unique motion descriptor for
every interest point in each frame of the video called Histogram of Oriented Gradients (HOG) are extracted around
each interest point. For HOG descriptors extracted from training videos, dictionary of Bag of Visual Features (BoVF)
is created using a k-means clustering. We construct video representations as normalized histograms for each action
video from BoVF to train and test the classifier. Support Vector Machine (SVM) classifiers are used to classify the
human body actions. Algorithms are evaluated on challenging KTH human motion dataset as well as with the custom
built dataset defined for four actions i.e. bouncing, boxing, jumping, and kicking in different scenario. The experiment
results show average class accuracy of the real-world custom dataset and KTH dataset.
Keywords: Spatio-temporal interest point (STIP), Histogram of Oriented Gradients (HOG), Bag of Visual Features
(BoVF), Support Vector Machine (SVM).
1. INTRODUCTION
The recognition of human actions from videos has various important & interesting applications, there are
different approaches proposed for this task. The proposed work is to develop a Human Activity Recognition (HAR)
system to detect and recognize ongoing activities of human beings automatically from an unknown video sequence. The
HAR system leads to the development of real time applications. Activities of human s have been classified based on the
complexity: gestures, actions, interactions, and group activities [1]. There are several real world applications of HAR
system based on the extracted information from the activity videos such as the development of HAR system for law
enforcement purpose in many public places to detect activities [1]. The other real time app lications are patient monitoring
system in hospital, child care and elderly persons monitoring system. Challenging problem is distinctive scenarios like
non stationary camera, moving background, size variations, different clothing of people, appearance, velocity and so on.
The robust recognition system should overcome the above problem in order to develop a successful HAR system.
The presented work is about to perform detection and recognition of actions of human body. This can be
achieved by extracting Space Time Interest Points (STIP). The number of feature points extracted varies for every frame;
these features characterize the motion events locally in the video. HOG descriptors are extracted around each STIP and
BoVF is created from the extracted descriptors. The normalized histogram (feature vector) is constructed with reference to
the dictionary of BoVF for the entire action video sequence. These distinctive attribute vectors are used to train & test the
SVM classifier. To evaluate the system, KTH database is used and the evaluation result for recognizing four types of
human actions i.e. hand waving, boxing, walking and running in different scenarios is tabulated by using confusion matrix
and also examined in order to detect the accuracy with the built custom dataset.
2. LITERATURE SURVEY
The Human Activity Recognition (HAR) methodologies are classified into two, viz.,: single -layered &
hierarchical. Firstly, Single layered approach which is best suited for low level activities like gestures and actions. In this
features are extracted directly from the image sequence which is suitable for periodic repetition of action. Secondly,
hierarchical approach is suited to represent high-level activities like interaction and group activity, which are described in
terms of other simpler activities called sub events.
IJRISE| www.ijrise.org|editor@ijrise.org[193-198]
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Since the idea of proposed work is on recognition of human actions, the review is confined to first approach. Fig1 shows the tree structure taxonomy of Single-layered approach. Single-layered approach is categorized into two viz.,
space-time approach and sequential approach
Single-layered
approaches
Space-time
approaches
Space-time
Volume
Space-time
T rajectories
Sequential
approaches
Space-time
features
Exemplar-based
State-based
Fig-1:. Tree structure taxonomy for single layered approach [1].
Several successive methods have been proposed based on the approaches. Davis and Bobick [3], patches are
extracted from a video volume and this is compared with the video volumes of the stored database. Ke et al. [4], three
dimensional segments related to motion of human body parts are extracted from the patches of video sequence
automatically. Rodriguez et al. [5] used generated filters to capture the volume characteristics of volumes to compare the
same with the characteristics of posed video. Sheikh et al. [6] and Yilmaz and Shah [7], shown a method to extract
trajectories from the video volume which represents the action in a four dimensional space. Further to represent action an
affine projection is used, and action is represented using three dimension trajectories which are normalized trajectories,
the extracted features are view-invariant. Rao and Shah’s [8] methodology shows how to extract the curvature patterns
meaningfully from the trajectories. Chomat et al. [9], ZelnikManor et al. [10], Shechtman et al. [11], in their work shown
that, by concatenating the frames temporally to represent or to characterize the overall motion. The different approaches
from Laptev’s [12], Dollar’s [13], Niebles et al. [14] Yilmaz et al. Shah [15], Ryoo’s [16], in their work extract sparse
STIP from a three dimensional video signal. Schuldt, Caputo and Ivan Laptev [17], G. Zhu, C. Xu, et al [18] presented a
paper on recognizing human actions for the extracted features using SVM’s, in which local features are extracted to
capture events locally in video sequence. Seyyed Majid Valiollahzadeh, et al. [19] p roposed an algorithm to employ
merger of Adaboost with Support Vector Machine (SVM) as weak component classifiers to be used in Face detection
Task. In the proposed work space-time local feature approach under space- time layered approach to extract features. The
space-time local approach has several advantages compared to other approaches, background subtraction, detailed
information is not required, and the extracted information are invariant to size of the object, direction and moment. Thus
this method is best suited for recognizing repetitive actions like ‘jumping' and ‘clapping', since these actions will generate
repetitive feature patterns.
3. DESIGN METHODOLOGY
Fig-2 and Fig-3 illustrates the outline of the human activity recognition and its blocks are briefly described. In
order to have a robust recognition, system should be invariant to scale, velocity, non homogeneous background and
noises. There are two sets of data, one is for training and another set is for testing. First, the training dataset s are needed to
train the entire system with relevant details of the actions present in the video sequences and second, the testing dataset in
general which is used for performance evaluation of the system. The details of the each models of the proposed de sign are
as follows.
Video
Split Video into
Convert Frames
Sequence
Frames
to Images
Pre-processing
Feature
Extraction
Fig-2: Feature Extraction M ode
3.1 Feature Extraction Model: One has to analyze the video frame by frame in order to have the details of
Extracted
Features Input
Supervised
Learning
Action
Classifier
Features of Action
database
Fig-3: Classification M odel
IJRISE| www.ijrise.org|editor@ijrise.org[193-198]
Acti on
La bel
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Object appearance, shape and motion characteristics as the input to the proposed system is the video sequence.
Thus, the video sequence is converted into frames and in turn into images. The input image needs to be pre-processed to
remove unwanted signals in order to preserve the required details. In general, the input signal is affected by unwanted
signals such as, noise in the capturing device, lighting variation, changes in background clutters like trees and so on, due
to which the required details cannot to be extracted satisfactorily. This process is done before feature extraction. Simple
Gaussian filter is used for pre-processing. The STIP are extracted from the pre-processed image, HOG descriptors are
extracted for each STIP to create BoVF.
3.2 Classification Model: Non-linear SVM is used as classifiers. The constructed features from BoVF are used to
train and test the classifiers.
4. FEATURE EXTACTION
Feature extraction methods used in the proposed methodology includes extraction of Spatio -temporal features, Histogram
of Gradients (HOG) and creation of Bag of Visual Features (BoVF).
4.1 Spatio-temporal features: To model three dimensional input Video sequences, we use a function
F : R 2  R  R to construct its linear space representation L : R 2  R  R 2   R by the convolution of I with
anisotropic 3-D Gaussian kernels of with spatial variance
 l 2 and temporal variance  l 2
[12].
L (.; l , l )  G ( l , l )  F ()
2
2
2
2
(1)
Anisotropic 3-D Gaussian kernel is defined as,
1
G( x, y, t ,  l , l ) 
2
2
exp( ( x 2  y 2 ) / 2 l  t 2 / 2 l )
2
(2 )  l  l
3
4
2
2
(2)
It is required to find the significant changes in horizontal and vertical direction of an image by detecting Harris interest
points. These points are found by second moment 3×3 matrix consisting of first order spatial and temporal derivatives
convolved with another 3-D Gaussian sliding window function with the variance
 Lx 2

  G (.; i 2 , i 2 )   Lx L y

 Lx Lt
The integration scales
L y Lt
order derivatives are defined as
and  l
2
2
Lx Lt 

L y Lt 
2 
Lt 
Lx L y
2
Ly
 i 2 and  i 2 are related to the local scales  l 2
G(.; l , l )
2
as
(3)
 i 2  s l 2 and  i 2  s l 2 . The first
Lx (.; l , l )   x (G  F )
2
2
(4)
L y (.; l , l )   y (G  F )
(5)
Lt (.; l , l )   t (G  F )
(6)
2
2
2
2
For the given video input I, to detect the interest points in such a way that, we should get significant Eigen values
1 , 2 , 3 of μ. To find such regions Harris corner detector is extended to three dimension, i.e., in temporal domain. The
extension of the Harris corner function is combination of determinant and the trace of μ is depicted below.
H = det( ) - k trace3 ( ) = 123 - k (1  2  3 )3
To prove that the positive local maxima of H correspond to points with high values 1 , 2 , 3
are defined as
(7)
(1  2  3 ) , the ratios
  2 / 1 and   3 / 1 and rewrite H as
H  1 (  k (1     )3 )
3
For H  0 ,
we
can
write
k   /(1     ) and
3
it
follows
(8)
its
maximum
possible
value
of
k  1 / 27 when     1 .Thus, three dimensional interest points in an image I can be identified by detecting local
positive in H [12]. Figure 3 shows the detected STIP for walking and boxing action sequence.
IJRISE| www.ijrise.org|editor@ijrise.org[193-198]
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
(i)
(ii)
Figure 3: Local space time features detected for walking and boxing.
4.2 Histogram of Oriented Gradient: Local object appearance and shape can be characterized well by the local
intensity gradients distribution or edge direction, without knowledg e of the corresponding gradients or edge positions. The
region of interest image is divided into small cells, a cell accumulating a local one dimensional histogram of gradient
directions using detection window technique and concatenating entire histogram of gradient entries to form single vector
representation. For better acquisition of characteristics to illumination, shadowing, etc. variations, the detection window is
overlapped for one cell size before using them. It is also useful to normalize the contrast in the local responses. HOG
descriptors are extracted for all the video sequence around each STIP.
4.3 Bag of Visual Features: The BoVF is created using these HOG descriptors from training videos by using a kmeans clustering. The centroid points of each cluster forms the keyword in the dictionary [21]. This dictionary of BoVF is
used to generate a feature vector from each action video to train and test the classifier.
5. CLASSIFICATION
Support Vector Machine: SVM is a linear classifier & can easily be mapped to a multi dimensional space by the
 
 
kernel trick. The SVM linear classifier is a dot product of data point vectors. Let K ( xi, xj )  xi T , xj , then the SVM

 
classifier function is f ( x )  sign(iyiK ( xi, xj )  b) . Mapping every data point into a multi dimensional space by
i


transformation  : x   ( x ) .
Figure. 4: M apping of (a) input space to (b) feature space


 


The dot product becomes  ( xi )T   ( xj ) and then calculating the quantity K ( xi, xj )   ( xi )T  ( xj ) . A kernel function
K is a function that corresponds to a dot product in some expanded feature space. The two kernel used are Polynomial
kernel and Radial Basis Function (RBF). An RBF is mapping the data into an infinite dimensional Hilbert space. The
 
 
RBF is a Gaussian distribution, calculated as: K ( x, z )  e  ( x  z )
2
/( 2 2 )
.
6. EXPERIMENTAL SETUP
The proposed method evaluated by using KTH and Custom dataset is used to extract features .
KTH Dataset: In this work, four types of human actions are considered namely boxing, hand waving, walking and
running performed under four different scenarios: namely outdoors , outdoors with scale variation, outdoors with different
clothes, and indoors. Currently 160 videos have been selected i.e., 40 videos from each action category which is used for
training a classifier from the extracted feature vectors and 80 videos i.e., 20 videos from each category exclusively used
for testing purposes. All videos were taken over homogeneous backgrounds with a static camera with 25fps rate.
Custom Dataset: Four types of human actions are considered namely bouncing, boxing, jumping and kicking performed
under four different scenarios: namely: indoors s1, outdoors s2, different clothes s3, and different angle s4. Built 140
videos i.e., 35 videos from each action category which is used for training a classifier from the extracted feature vectors
and 20 videos i.e., 5 videos from each category exclusively used for testing purposes. All videos were taken over nonhomogeneous backgrounds with a static camera with 30fps rate. The videos were down sampled to the spatial resolution
of 160×120 pixels.
IJRISE| www.ijrise.org|editor@ijrise.org[193-198]
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Here, for three dimensional video volumes histogram descriptors is computed in the neighbourhood around each
STIP based on the motion and appearance characteristics. The space time volume is 32×32×5 size and for each 32×32 size
block HOG descriptor is found by dividing each block into sub blocks of 8×8. The generated HOG descriptor [20] around
each STIP is 1×324 size. The descriptors obtained from all the training videos are clustered b y using k-mean clustering,
where the value k=5000 is chosen empirically. The dictionary of BoVF of size 324×5000 is formed by centroid points of
k-mean clustering and the feature vectors of size 1×5000 is generated using this BoVF, which is a single feat ure point for
entire action video sequence. And these feature vectors are used to train and test the classifiers.
Fig-7: Sequences of different types of actions in different scenarios: Custom Dataset
7. RESULT
The above procedure is implemented in Microsoft Visual Studio using OpenCV libraries. Table-1 shows
confusion matrix obtained for SVM classification for KTH and custom dataset. The classifier is able to distinguish the
respective action properly. From the confusion matrix for the tested video sequences we got 82.5% average class accuracy
on the KTH dataset. The same method is used to extract the feature vectors are from the built custom dataset to train and
test the SVM and the resulting average class accuracy is 70% [16].
TABLE -1: CONFUSION M ATRIX
Boxing
Hand waving
Running
Walking
Boxing
19
1
0
0
Hand waving
0
20
0
0
Running
0
0
20
0
Walking
0
0
13
7
SVM classifier with KTH Dataset
Bouncing
Boxing
Jumping
Kicking
Bouncing
5
0
0
0
Boxing
2
3
0
0
Jumping
0
0
5
0
Kicking
3
0
1
1
SVM classifie r with Custom Dataset
SVM Classifiers
Accuracy
IJRISE| www.ijrise.org|editor@ijrise.org[193-198]
KTH Dataset
Custom Dataset
82.5%
70%
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
8. CONCLUSION
We have demonstrated the approaches of extracting spatio temporal HOG features are well suited for action
recognition and these features are very much suitable for the training and testing the supervised algorithm. The
recognition task implemented using SVM algorithm. The accurate recognition of different actions using SVM classifier
gives an average class accuracy of 82.5% for KTH dataset, which is to be better than the result obtained of 72% in the B
Caputo, et al [16]. Further, the same procedure is followed for extracting spatio temporal HOG features from the built
Custom dataset and recognition task using SVM classifier gives average class accuracy of 70%.
REFERENCES
[1] J.K. Aggarwal, et al., "Human activity analysis", ACMCS, 2011.
[2] A F Bobick, J W Davis, "The recognition of human movement using temporal templates," PAMI, IEEE
Transactions on, pp.257-267, 2001.
[3] Yan Ke et al., "Spatio-temporal Shape and Flow Correlation for Action Recognition," CVPR '07. IEEE Conference
on, vol. 1, pp. 17-22, 2007.
[4] M D Rodriguez, et al, Action MACH a Spatio-temporal Maximum Average Correlation Height filter for action
recognition, CVPR 2008. IEEE Conference, vol.1, no.8, pp. 23-28, 2008.
[5] Y Sheikh, et al., Exploring the space of a human action, ICCV 2005. 10th IEEE International Conference, vol.144,
no.149, pp. 17-21, 2005.
[6] A Yilmaz, M Shah, Actions sketch: a novel action representation, CVPR 2005. IEEE-CSC on, vol.1, pp. 20-25,
2005.
[7] C Rao, Shah, M., View-invariance in action recognition, Proceedings of the 2001 IEEE -CSC on, vol.2, pp.316-322,
2001.
[8] O Chomat, et al., Probabilistic recognition of activity using local appearance, CVPR 1999. IEEE- CSC on. , vol.2,
no.109, 1999.
[9] Z Manor, L, M Irani, Event-based analysis of video,. CVPR 2001. Proceedings of the 2001 IEEE-CSC on, vol.2,
no.2, pp.123-130, 2001.
[10] Shechtman, et al., Actions as space-time shapes, ICCV 2005. 10th IEEE International Conference on , vol.2,
no.1395, pp. 17-21, 2005.
[11] Ivan Laptev and Lindeberg, Space-time interest points, Proceedings. 9th IEEE International Conference on, vol.,
no.439, pp. 13-16, 2003.
[12] P Dollar, et al., Behavior recognition via sparse Spatio -temporal features, VSPETS, 2005, International IEEE
Workshop on, 65,72, 1516, 2005.
[13] J C Niebles,et al., Unsupervised learning of human action categories using spatial-temporal words, IJCV,
79,3,299-318, ISSN: 09205691,2008.
[14] A Yilma, M Shah, Recognizing human actions in videos acquired by uncalibrated moving camera, CV 2005.
10th International IEEE Conference on, 1,150, 157, 17-21, 2005.
[15] M S Ryoo et al., Semantic understanding of continued and recursive human activities, Int. Conf. on Pattern R
Ms. Yashaswini Mahesh is pursuing her post-graduation degree in Digital Communication and
Networking stream at GSSS Institute of Engineering and Technology for Women, Mysore.
Dr. M. Shivakumar, did his M.Tech in 1998 and Ph.D. in 2011 from University of Mysore. He has a
teaching experience of more than 20 years. At present, he is working as Professor and Head in the
Department of Telecommunication Engineering of GSSS Institute of Engineering and Technology for
Women, Mysore. He has published several papers in international conferences and journals.
Dr. H. S. Mohana, has a teaching experience of more than 27 years. At present, he is working as
Professor in the Department of Instrumentation Technology of Malnad College of Engineering, Hassan.
He has published several papers in international conferences and journals. His areas of interest include
Computer Vision Based Instrumentation, Digital Image Processing & Pattern Recognition.
IJRISE| www.ijrise.org|editor@ijrise.org[193-198]