CLASSIFICATION OF HUMAN ACTIONS USING NON-LINEAR SVM BY EXTRACTING SPATIO TEMPORAL HOG FEATURES WITH CUSTOM DATASET Yashaswini Mahesh1 , Dr. M. Shivakumar2 , Dr. H.S.Mohana3 1 Dept. of ECE, GSSSIT, Mysuru,, Dept. of TCE, 3 Dept. IT, GSSIT, Mysuru, 3 Malnad College of Engineering, Hassan , 2 ABSTRACT Local space-time features capture local events in a video sequence; these features help to extract motion descriptors which capture motion sequence in the video. This paper proposes that the actions of human body are recognized by detecting interest points called Spatio-temporal interest point (STIP). A unique motion descriptor for every interest point in each frame of the video called Histogram of Oriented Gradients (HOG) are extracted around each interest point. For HOG descriptors extracted from training videos, dictionary of Bag of Visual Features (BoVF) is created using a k-means clustering. We construct video representations as normalized histograms for each action video from BoVF to train and test the classifier. Support Vector Machine (SVM) classifiers are used to classify the human body actions. Algorithms are evaluated on challenging KTH human motion dataset as well as with the custom built dataset defined for four actions i.e. bouncing, boxing, jumping, and kicking in different scenario. The experiment results show average class accuracy of the real-world custom dataset and KTH dataset. Keywords: Spatio-temporal interest point (STIP), Histogram of Oriented Gradients (HOG), Bag of Visual Features (BoVF), Support Vector Machine (SVM). 1. INTRODUCTION The recognition of human actions from videos has various important & interesting applications, there are different approaches proposed for this task. The proposed work is to develop a Human Activity Recognition (HAR) system to detect and recognize ongoing activities of human beings automatically from an unknown video sequence. The HAR system leads to the development of real time applications. Activities of human s have been classified based on the complexity: gestures, actions, interactions, and group activities [1]. There are several real world applications of HAR system based on the extracted information from the activity videos such as the development of HAR system for law enforcement purpose in many public places to detect activities [1]. The other real time app lications are patient monitoring system in hospital, child care and elderly persons monitoring system. Challenging problem is distinctive scenarios like non stationary camera, moving background, size variations, different clothing of people, appearance, velocity and so on. The robust recognition system should overcome the above problem in order to develop a successful HAR system. The presented work is about to perform detection and recognition of actions of human body. This can be achieved by extracting Space Time Interest Points (STIP). The number of feature points extracted varies for every frame; these features characterize the motion events locally in the video. HOG descriptors are extracted around each STIP and BoVF is created from the extracted descriptors. The normalized histogram (feature vector) is constructed with reference to the dictionary of BoVF for the entire action video sequence. These distinctive attribute vectors are used to train & test the SVM classifier. To evaluate the system, KTH database is used and the evaluation result for recognizing four types of human actions i.e. hand waving, boxing, walking and running in different scenarios is tabulated by using confusion matrix and also examined in order to detect the accuracy with the built custom dataset. 2. LITERATURE SURVEY The Human Activity Recognition (HAR) methodologies are classified into two, viz.,: single -layered & hierarchical. Firstly, Single layered approach which is best suited for low level activities like gestures and actions. In this features are extracted directly from the image sequence which is suitable for periodic repetition of action. Secondly, hierarchical approach is suited to represent high-level activities like interaction and group activity, which are described in terms of other simpler activities called sub events. Since the idea of proposed work is on recognition of human actions, the review is confined to first approach. Fig1 shows the tree structure taxonomy of Single-layered approach. Single-layered approach is categorized into two viz., space-time approach and sequential approach Single-layered approaches Space-time approaches Space-time Volume Space-time T rajectories Sequential approaches Space-time features Exemplar-based State-based Fig-1:. Tree structure taxonomy for single layered approach [1]. Several successive methods have been proposed based on the approaches. Davis and Bobick [3], patches are extracted from a video volume and this is compared with the video volumes of the stored database. Ke et al. [4], three dimensional segments related to motion of human body parts are extracted from the patches of video sequence automatically. Rodriguez et al. [5] used generated filters to capture the volume characteristics of volumes to compare the same with the characteristics of posed video. Sheikh et al. [6] and Yilmaz and Shah [7], shown a method to extract trajectories from the video volume which represents the action in a four dimensional space. Further to represent action an affine projection is used, and action is represented using three dimension trajectories which are normalized trajectories, the extracted features are view-invariant. Rao and Shah's [8] methodology shows how to extract the curvature patterns meaningfully from the trajectories. Chomat et al. [9], ZelnikManor et al. [10], Shechtman et al. [11], in their work shown that, by concatenating the frames temporally to represent or to characterize the overall motion. The different approaches from Laptev's [12], Dollar's [13], Niebles et al. [14] Yilmaz et al. Shah [15], Ryoo's [16], in their work extract sparse STIP from a three dimensional video signal. Schuldt, Caputo and Ivan Laptev [17], G. Zhu, C. Xu, et al [18] presented a paper on recognizing human actions for the extracted features using SVM's, in which local features are extracted to capture events locally in video sequence. Seyyed Majid Valiollahzadeh, et al. [19] p roposed an algorithm to employ merger of Adaboost with Support Vector Machine (SVM) as weak component classifiers to be used in Face detection Task. In the proposed work space-time local feature approach under space- time layered approach to extract features. The space-time local approach has several advantages compared to other approaches, background subtraction, detailed information is not required, and the extracted information are invariant to size of the object, direction and moment. Thus this method is best suited for recognizing repetitive actions like 'jumping' and 'clapping', since these actions will generate repetitive feature patterns. 3. DESIGN METHODOLOGY Fig-2 and Fig-3 illustrates the outline of the human activity recognition and its blocks are briefly described. In order to have a robust recognition, system should be invariant to scale, velocity, non homogeneous background and noises. There are two sets of data, one is for training and another set is for testing. First, the training dataset s are needed to train the entire system with relevant details of the actions present in the video sequences and second, the testing dataset in general which is used for performance evaluation of the system. The details of the each models of the proposed de sign are as follows. Video Split Video into Convert Frames Sequence Frames to Images Pre-processing Feature Extraction Fig-2: Feature Extraction M ode 3.1 Feature Extraction Model: One has to analyze the video frame by frame in order to have the details of Extracted Features Input Supervised Learning Action Classifier Features of Action database Fig-3: Classification M odel Object appearance, shape and motion characteristics as the input to the proposed system is the video sequence. Thus, the video sequence is converted into frames and in turn into images. The input image needs to be pre-processed to remove unwanted signals in order to preserve the required details. In general, the input signal is affected by unwanted signals such as, noise in the capturing device, lighting variation, changes in background clutters like trees and so on, due to which the required details cannot to be extracted satisfactorily. This process is done before feature extraction. Simple Gaussian filter is used for pre-processing. The STIP are extracted from the pre-processed image, HOG descriptors are extracted for each STIP to create BoVF. 3.2 Classification Model: Non-linear SVM is used as classifiers. The constructed features from BoVF are used to train and test the classifiers. 4. FEATURE EXTACTION Feature extraction methods used in the proposed methodology includes extraction of Spatio -temporal features, Histogram of Gradients (HOG) and creation of Bag of Visual Features (BoVF). 4.1 Spatio-temporal features: To model three dimensional input Video sequences, we use a function F : R 2 R R to construct its linear space representation L : R 2 R R 2 R by the convolution of I with anisotropic 3-D Gaussian kernels of with spatial variance l 2 and temporal variance l 2 [12]. L (.; l , l ) G ( l , l ) F () 2 2 2 2 (1) Anisotropic 3-D Gaussian kernel is defined as, 1 G( x, y, t , l , l ) 2 2 exp( ( x 2 y 2 ) / 2 l t 2 / 2 l ) 2 (2 ) l l 3 4 2 2 (2) It is required to find the significant changes in horizontal and vertical direction of an image by detecting Harris interest points. These points are found by second moment 3×3 matrix consisting of first order spatial and temporal derivatives convolved with another 3-D Gaussian sliding window function with the variance Lx 2 G (.; i 2 , i 2 ) Lx L y Lx Lt The integration scales L y Lt order derivatives are defined as and l 2 2 Lx Lt L y Lt 2 Lt Lx L y 2 Ly i 2 and i 2 are related to the local scales l 2 G(.; l , l ) 2 as (3) i 2 s l 2 and i 2 s l 2 . The first Lx (.; l , l ) x (G F ) 2 2 (4) L y (.; l , l ) y (G F ) (5) Lt (.; l , l ) t (G F ) (6) 2 2 2 2 For the given video input I, to detect the interest points in such a way that, we should get significant Eigen values 1 , 2 , 3 of μ. To find such regions Harris corner detector is extended to three dimension, i.e., in temporal domain. The extension of the Harris corner function is combination of determinant and the trace of μ is depicted below. These points are found by second moment 3×3 matrix consisting of first order spatial and temporal derivatives convolved with another 3-D Gaussian sliding window function with the variance Lx 2 G (.; i 2 , i 2 ) Lx L y Lx Lt The integration scales L y Lt order derivatives are defined as and l 2 2 Lx Lt L y Lt 2 Lt Lx L y 2 Ly i 2 and i 2 are related to the local scales l 2 G(.; l , l ) 2 as (3) i 2 s l 2 and i 2 s l 2 . The first Lx (.; l , l ) x (G F ) 2 2 (4) L y (.; l , l ) y (G F ) (5) Lt (.; l , l ) t (G F ) (6) 2 2 2 2 For the given video input I, to detect the interest points in such a way that, we should get significant Eigen values 1 , 2 , 3 of μ. To find such regions Harris corner detector is extended to three dimension, i.e., in temporal domain. The extension of the Harris corner function is combination of determinant and the trace of μ is depicted below. H = det( ) - k trace3 ( ) = 123 - k (1 2 3 )3 To prove that the positive local maxima of H correspond to points with high values 1 , 2 , 3 are defined as (7) (1 2 3 ) , the ratios 2 / 1 and 3 / 1 and rewrite H as H 1 ( k (1 )3 ) 3 For H 0 , we can write k /(1 ) and 3 it follows (8) its maximum possible value of k 1 / 27 when 1 .Thus, three dimensional interest points in an image I can be identified by detecting local positive in H [12]. Figure 3 shows the detected STIP for walking and boxing action sequence. (i) (ii) Figure 3: Local space time features detected for walking and boxing. 4.2 Histogram of Oriented Gradient: Local object appearance and shape can be characterized well by the local intensity gradients distribution or edge direction, without knowledg e of the corresponding gradients or edge positions. The region of interest image is divided into small cells, a cell accumulating a local one dimensional histogram of gradient directions using detection window technique and concatenating entire histogram of gradient entries to form single vector representation. For better acquisition of characteristics to illumination, shadowing, etc. variations, the detection window is overlapped for one cell size before using them. It is also useful to normalize the contrast in the local responses. HOG descriptors are extracted for all the video sequence around each STIP. 4.3 Bag of Visual Features: The BoVF is created using these HOG descriptors from training videos by using a kmeans clustering. The centroid points of each cluster forms the keyword in the dictionary [21]. This dictionary of BoVF is used to generate a feature vector from each action video to train and test the classifier. 5. CLASSIFICATION Support Vector Machine: SVM is a linear classifier & can easily be mapped to a multi dimensional space by the kernel trick. The SVM linear classifier is a dot product of data point vectors. Let K ( xi, xj ) xi T , xj , then the SVM classifier function is f ( x ) sign(iyiK ( xi, xj ) b) . Mapping every data point into a multi dimensional space by i transformation : x ( x ) . Figure. 4: M apping of (a) input space to (b) feature space The dot product becomes ( xi )T ( xj ) and then calculating the quantity K ( xi, xj ) ( xi )T ( xj ) . A kernel function K is a function that corresponds to a dot product in some expanded feature space. The two kernel used are Polynomial kernel and Radial Basis Function (RBF). An RBF is mapping the data into an infinite dimensional Hilbert space. The RBF is a Gaussian distribution, calculated as: K ( x, z ) e ( x z ) 2 /( 2 2 ) . 6. EXPERIMENTAL SETUP The proposed method evaluated by using KTH and Custom dataset is used to extract features . KTH Dataset: In this work, four types of human actions are considered namely boxing, hand waving, walking and running performed under four different scenarios: namely outdoors , outdoors with scale variation, outdoors with different clothes, and indoors. Currently 160 videos have been selected i.e., 40 videos from each action category which is used for training a classifier from the extracted feature vectors and 80 videos i.e., 20 videos from each category exclusively used for testing purposes. All videos were taken over homogeneous backgrounds with a static camera with 25fps rate. Custom Dataset: Four types of human actions are considered namely bouncing, boxing, jumping and kicking performed under four different scenarios: namely: indoors s1, outdoors s2, different clothes s3, and different angle s4. Built 140 videos i.e., 35 videos from each action category which is used for training a classifier from the extracted feature vectors and 20 videos i.e., 5 videos from each category exclusively used for testing purposes. All videos were taken over nonhomogeneous backgrounds with a static camera with 30fps rate. The videos were down sampled to the spatial resolution of 160×120 pixels. Here, for three dimensional video volumes histogram descriptors is computed in the neighbourhood around each STIP based on the motion and appearance characteristics. The space time volume is 32×32×5 size and for each 32×32 size block HOG descriptor is found by dividing each block into sub blocks of 8×8. The generated HOG descriptor [20] around each STIP is 1×324 size. The descriptors obtained from all the training videos are clustered b y using k-mean clustering, where the value k=5000 The same method is used to extract the feature vectors are from the built custom dataset to train and test the SVM and the resulting average class accuracy is 70% [16]. TABLE -1: CONFUSION M ATRIX Boxing Hand waving Running Walking Boxing 19 1 0 0 Hand waving 0 20 0 0 Running 0 0 20 0 Walking 0 0 13 7 SVM classifier with KTH Dataset Bouncing Boxing Jumping Kicking Bouncing 5 0 0 0 Boxing 2 3 0 0 Jumping 0 0 5 0 Kicking 3 0 1 1 SVM classifie r with Custom Dataset SVM Classifiers Accuracy IJRISE||[193-198] KTH Dataset Custom Dataset 82.5% 70% International Journal of Research In Science & Engineering Volume: 1 Special Issue: 2 e-ISSN: 2394-8299 p-ISSN: 2394-8280 8. CONCLUSION We have demonstrated the approaches of extracting spatio temporal HOG features are well suited for action recognition and these features are very much suitable for the training and testing the supervised algorithm. The recognition task implemented using SVM algorithm. At present, he is working as Professor and Head in the Department of Telecommunication Engineering of GSSS Institute of Engineering and Technology for Women, Mysore. He has published several papers in international conferences and journals. Dr. H. S. Mohana, has a teaching experience of more than 27 years. At present, he is working as Professor in the Department of Instrumentation Technology of Malnad College of Engineering, Hassan. He has published several papers in international conferences and journals. His areas of interest include Computer Vision Based Instrumentation, Digital Image Processing & Pattern Recognition. IJRISE||[193-198]
