MONTE CARLO SAMPLING FOR VISUAL POSE TRACKING Jehoon Lee1 , Romeil Sandhu1 , and Allen Tannenbaum1,2 1 Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA. 2 Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA. jehoon.lee@gatech.edu, rsandhu@gatech.edu, tannenba@ece.gatech.edu ABSTRACT In this paper, we present a visual pose tracking algorithm based on Monte Carlo sampling of special Euclidean group SE(3) and knowledge of a 3D model. In general, the relative pose of an object in 3D space can be described by sequential transformation matrices at each time step. Thus, the objective of this work is to find a transformation matrix in SE(3) so that the projection of an object transformed by this matrix coincides with an object of interest in the 2D image plane. To do this, first, the set of these transformation matrices is randomly generated via an autoregressive model. Next, 3D transformation is performed on a 3D model by these matrices. Finally, a region-based energy model is designed to evaluate the optimality of a transformed model’s projection. Experimental results demonstrate the robustness of the proposed method in several tracking scenarios. Index Terms— Visual tracking, 2D-3D pose estimation, Monte Carlo sampling 1. INTRODUCTION Visual tracking is one of the important major topics of the computer vision community. Numerous algorithms have been proposed to locate and track objects of interest in recent years; see [1] and references therein. In this paper, we address the problem of 2D-3D pose estimation. This problem consists of determining object’s pose (position and orientation) in 3D, relative to a calibrated camera, from a 2D image sequence; the relative pose can be described in terms of trajectories on a transformation matrix [2]. Thus, the objective of this paper is to find a 3D transformation matrix to optimally estimate the pose of an object of interest. To this end, we propose the Monte Carlo based sampling method to estimate a 3D transformation matrix for 2D object tracking and 3D pose estimation. In addition, we take advantage of knowledge of a 3D model of an object. This allows the tracker to capture various aspects (or quasi-deformation) of an object with respect to its dynamic motion, while the methods based on learning a collection of 2D shape priors have difficulties in completely describing them. Several attempts to estimate the pose of a 3D object based on a 2D image sequence have been presented. For example, in [3], the pose is estimated using local features, such as visible edges. These local feature based methods have difficulties in dealing with a cluttered background due to their sensitivity to noise. To make up for This work was supported in part by grants from NSF, AFOSR, ARO, as well as by a grant from NIH (NAC P41 RR-13218) through Brigham and Women’s Hospital. This work is part of the National Alliance for Medical Image Computing (NAMIC), funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant U54 EB005149. Information on the National Centers for Biomedical Computing can be obtained from http://nihroadmap.nih.gov/bioinformatics/. this weakness, the authors in [4, 5] present a region-based model for 2D-3D pose estimation. In particular, [5] presents a variational approach to jointly perform 2D segmentation and 3D pose estimation. It is based on minimizing a region-based energy functional using a gradient descent flow. The proposed algorithms are related to the work in [5]. Compared with the approaches in [5], in this paper, to find the optimal pose of an object of interest, we randomly generate the set of transformation matrices and then evaluate the proposed statistical energy model for each transformation matrix, unlike the method of [5] based on the gradient flow. To construct 3D transformation matrices, Monte Carlo sampling of the matrix group SE(3) is used, which is inspired by some filtering schemes on Lie groups in literature [6, 7]. This approach takes into account the underlying geometry of SE(3) rather than employing a local coordinate-based sampling because the set of three-dimensional transformations is not a vector space, but is closely involved in curved Lie groups [7]. The work in [6] presents a pose tracking method working on SE(3) in a Bayesian framework and discusses minimum mean-square error on SE(3) and SO(3). The authors in [7] sequentially draw the set of two-dimensional affine group from a particle filtering and propagate it via autoregressive process developed for state dynamics. Our work addresses a more general case of the work in [7] since we focus not only on tracking 2D motion but also estimating a 3D pose. 2. PRELIMINARIES 2.1. Notation and Tranformation Matrix It is assumed that the camera is calibrated and the object motion is relatively smooth. Let X = [X, Y, Z]T be the coordinates of a point in R3 with respect to the referential of the camera. Here, the calibrated camera is modeled as an ideal perspective projection: π : R3 7→ Ω; X 7→ x, where Ω ⊂ R2 is the domain of an image I(x) and x = [x, y]T = [X/Z, Y /Z]T denotes coordinates in Ω. S is a smooth surface representing the shape of interest in R3 and its projected region is R = π(S) ⊂ Ω. Rc = Ω \ R is the complementary region of R. Similarly, the curve cˆ = π(C) ⊂ Ω is the projection of the curve C ⊂ S in 3D where cˆ denotes a boundary of R: cˆ = ∂R in 2D. Let X0 ∈ R3 and X ∈ R3 be the coordinates of points on S0 and S, respectively. Here S0 is the identical reference surface in 3D. Since the calibrated camera is set, the pose of the object with respect to the camera is represented as T ∈ SE(3), which is a 4 × 4 transformation matrix of a 3D rigid body motion so that S = T (S0 ). It yields the following expression in a point wise fashion: X = T (X0 ) = RX0 + t where R ∈ SO(3) is a 3 × 3 rotational group and t ∈R3 is a 3 × 1 translation. Transformation R t T is represented as T = in homogeneous coordinates. 0 1 Lie generators for 3D rigid motion span the tangent space to the Lie group manifold. Each generator corresponds to each geometric transformation, i.e., translations along the x, y, and z axes, rotations about the x, y, and z axes, respectively, given by (see [8] for the detail derivation): "0 0 0 0# "0 0 0 0# "0 0 0 1# B1 = B4 = 0 0 0 "0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 −1 0 0 , B2 = 0# 0 , B5 0 0 = 0 0 0 1 0 0 0 0 0 0 0 0 " 0 01 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 " 0 −1 0# 0 , B6 = 10 00 0 0 0 0 , B3 = , 0 0 0 0 (1) 0# 0 . 0 0 Several papers on distance metrics on SE(3) can be found in literature [6, 7, 9]. Here, for example, one way of evaluating the mean of special orthogonal matrices SO(3) is presented in [9]. The rotation mean R ∈ SO(3) of N rotations is explicitly evaluated as: T V UT for det(G ) ≥ 0 R= (2) V DU T for otherwise, h1 0 0 i PN T where D = 0 1 0 . G = N1 i=1 Ri and its transpose G can 0 0 −1 T be decomposed using the singular value decomposition as G = 3 U ΣV T . For translational components, the arithmetic mean # t∈R " R t . is used. Thus, the sample mean on SE(3) is T = 0 1 3. PROPOSED METHOD 3.1. Motion Model In general, information we have on the dynamics of a system is incomplete. One way to overcome the difficulties associated with this lack of information is to use a stochastic model for the system dynamics. One can choose a random walk model for its simplicity. However, since the random walk model requires a sufficiently large number of samples to be practical; a Newtonian motion model [6] or an infinitesimal constant velocity model [7] can be chosen for a more efficient way to reach the optimal solution. In this work, by extending the work in [7], we construct the first-order autoregressive (AR) model on SE(3) based on motion variation to determine how a transformation matrix is generated, which is given by √ −1 Tk = Tk−1 exp a log(Tk−2 Tk−1 ) + duk ∆k , (3) where exp(·) and log(·) denote the matrix exponential and the matrix logarithm, respectively [2]. a in equation (3) is the AR process parameter and ∆k is a discrete time interval. And, duk is process noise determined by the degree of variation of an object motion and the Lie generators Bl (l = 1, ..., 6) as follows: duk = 6 X ρ(lth ) τ(lth ) Bl , τ ∼ N (0, σk ), (4) l=1 where N (·) represents the normal distribution and ρ(lth ) and τ(lth ) denote the lth element of ρ ∈ R6 and τ ∈ R6 , respectively. ρ is a user-defined diffusion weight vector; it controls the range of transformations in which we generate samples. In our experiments, ρ = [0.5, 0.5, 0.5, 0.1, 0.1, 0.1]T is used; but it can be relaxed by searching a larger space with the computational burden. To compute the covariance matrix σk , we represent T as a six-dimensional vector Fig. 1. Graph of motion variation. A solid line and a dotted line denote the covariance variation of the sequences given in Figure 4(a) and Figure 4(d), respectively. by using twist coordinates for rotation R and attaching translation t: λ = [tx , ty , tz , θx , θy , θz ]. Thus, the covariance matrix σk ∈ R6×6 is defined as: h i ˆ ik−1 − λk−1 )(λ ˆ ik−2 − λk−2 )T , σk = E (λ (5) ˆ i (i = 1, ..., N ) are parameters of the predicted transformawhere λ tion matrix in the Monte Carlo framework, which is described in Section 4. The AR process in equation (3) determines the distribution of samples in the sample space according to motion variance of the tracked object during tracking. For example, the elephant in the sequence shown in Figure 4(d) is placed in the same position through the sequence. On the other hand, the elephant of the sequence shown in Figure 4(a) dynamically changes its pose due to a moving camera. The graph of Figure 1 shows that the degree of object’s motion variation of the sequence in Figure 4(d) is much smaller than the sequence in Figure 4(a). Naturally, the degree of computed variance for a moving object is much higher than for a stationary object. 3.2. Energy Model The proposed energy model to evaluate the optimality of a candidate transformation is simply defined as: Z Z ER = ri (I(x), cˆ) dΩ + ro (I(x), cˆ) dΩ, (6) R Rc where ri : χ, Ω 7→ R and ro : χ, Ω 7→ R are mapping functions to measure the quality over inside (R) and outside (Rc ) the curve, respectively. Here, χ is the space that corresponds to photometric variable of interest. Intuitively, we want to drive the pose estimation so that the projected curve’s interior and exterior distributions are maximally different. Any ri and ro functions that statistically describe the region properties of R and Rc can be used. For example, one can choose the mean intensity model or the distinct Gaussian model. In our work, ri and ro are defined as: ri = log (Σi ) + (I(x) − µi )2 (I(x) − µo )2 , ro = log (Σo ) + , Σi Σo where Σi and Σo are covariance, and µi and µo are intensity averages for R and Rc , respectively (Σi/o ∈ R3×3 and µi/o ∈ R3 for a color image). 2 = 10% (a) Noise level σn Fig. 2. 3D point models used in this paper. From left to right: the letters “ABC”, a red car, and a grey elephant. 2 = 50% (b) Noise level σn % − e (in %) σn2 10% 50% 100% T.avg T.std T.med R.avg R.std R.med 0.64 0.83 1.57 0.40 0.55 1.01 0.56 0.83 1.49 6.73 6.57 8.89 5.00 5.43 8.48 4.72 3.96 5.05 2 = 100% (c) Noise level σn Fig. 3. Robustness to several noise levels. In (c), the letters “ABC” and a background are barely visibly distinguishable (frame order: from left to right). Table 1. Noise level table displaying percent(%)-absolute errors of the letters “ABC” sequence as given in Figure 3. T.avg, T.std, T.med, R.avg, R.std, and R.med denote average, standard deviation and median value of translation error and rotation error, respectively. Now, we assume that the statistical characteristic of the segmented object is not changing between consecutive frames. This assumption allows the tracker to find the curve with the most similar statistical property to the previously selected curve. Thus, we adopt the Bhattacharyya distance to evaluate the degree of similarity between intensity distributions of the tracked object in consecutive frames. The Bhattacharyya distance between two probability density functions, p1 and p2 , is defined by [10] Z p p1 (x)p2 (x)dx . (7) DB = − log RN Now, we represent p1 and p2 as a normalized histogram h1 and h2 , respectively. Histograms are relatively invariant to object motion, i.e., translation, rotation, and scale. Thus, metrics based on histograms can be good indicators of similarity between two distributions, for example, the intensity distributions of an object in consecutive video frames. A histogram can be computed by concatenating or by averaging histograms of each pre-defined color channel for color-based imagery. For example, one can obtain a histogram by using only pixels with large saturation and value in the HSV color space [11]. In our work, we use pixels inside cˆ for histogram computation in which we average histograms of each RGB channel. Now, given two histograms, h1 (x) and h2 (x) with x ∈ R, we have the following similarity distance measure using equation (7): Z p EB = 1 − h1 (x)h2 (x)dx. (8) We summarize the overall tracking approach by embedding the previously proposed algorithms into a Monte Carlo framework. (a) Initialize Tk=0,1 and hk=0,1 . (b) Generate N sample transformation matrices {Tki }N i=1 from √ −1 i equation (3): Tk = Tk−1 exp a log(Tk−2 Tk−1 ) + duk ∆k . Note that transformation matrix at a current step k depends on transformations until two step behind, k − 1 and k − 2. Also, Tki refers to the estimated transformations of the previous time steps, Tk−1 and Tk−2 , not each ith sample. This is different from the work of [7] based on particle filtering. (c) Perform 3D transformation on X0 : X i = Tki (X0 ) (d) Project the transformed 3D model onto a 2D image plane: Ri = π(X i ) and compute the histogram of Ri , hi (e) Evaluate the energy in equation (9) for each projected model: i = ER (Ri , Ik ) + EB (hk−1 , hi ) Eimage i (f) Estimate the optimal transformation using Tk∗ = arg min Eimage T t , where t = or computing the sample mean on SE(3), R 0 1 P P N N i i i i i 1 1 i=1 w t and G = N i=1 w R from equation (2). w N i is normalized value for e−Eimage . (g) Do iterations from step (b) to step (f) until Eimage converges or the energy difference between frames is small enough. (h) Let the histogram of the selected model be hk and go to step (b): k = k + 1. 5. EXPERIMENTAL RESULTS R The EB is within [0, 1] and as the degree of similarity between histograms increases, as EB goes to zero. Now, the optimal energy model is defined by adding two energy terms from equations (6) and (8): Eimage = ER + EB . 4. POSE TRACKING ALGORITHM (9) Therefore, the optimal projection (or projected curve) minimizes the Eimage . The proposed tracking algorithm is tested on several illustrative sequences in which color-valued images are used with except the synthetic sequence “ABC” which is grey scale. The 3D models used in our tests are shown in Figure 2. The number of the generated transformation matrices is held fixed at N = 100 across all experiments. As the number of transformation matrices increases, more accurate results are expected, but the computational complexity increases exponentially. 6. CONCLUSION Frame 3 Frame 59 Frame 190 (a) Frame 3 Frame 216 Frame 432 (b) In this note, we proposed a method for 2D-3D visual pose tracking via Monte Carlo sampling on a special Euclidean group SE(3). As shown in experiments, the proposed method shows promising results in optimally tracking the object of interest in 2D and estimating its pose in 3D under a cluttered environment. Moreover, the use of knowledge of a 3D model allows the tracker to capture the quasideformations of an object more accurately than 2D-based trackers. However, the proposed method has some limitations, which we intend to overcome in our future work. Our framework suffers from high computational complexity due to the nature of a sampling based method. One possible solution would be achieved by effectively exploiting the system dynamics, i.e., the number of samples can be adaptively changed at each time step according to the variation of object’s motion. For example, the pose of a stationary object can be obtained from the small sample size. 7. REFERENCES Frame 50 Frame 90 Frame 135 [1] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Computing Surveys (CSUR), vol. 38, no. 4, 2006. [2] Y. Ma, S. Soatto, J. Kosecka, and S. Sastry, An Invitation to 3D Vision, Spinger-Verlag, 2003. Frame 275 Frame 376 Frame 560 (c) [3] T. Drummond and R. Cipolla, “Real-time visual tracking of complex structures,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 932–946, 2002. [4] C. Schmaltz, B. Rosenhahn, T. Brox, D. Cremers, J. Weickert, L. Wietzke, and G. Sommer, “Region-based pose tracking,” Pattern Recognition and Image Analysis, pp. 56–63, 2007. Frame 91 Frame 109 Frame 130 (d) Fig. 4. Tracking (a), (b), and (d) a grey elephant, and (c) a red car in noisy and cluttered environments. First of all, to demonstrate the robustness to noise of the proposed method, the synthetic sequences are created with different levels of Gaussian noise whose variance, σn2 , ranges from 10% to 100% as shown in Figure 3. Percent(%)-absolute errors are computed for both translation and rotation over each noise level for quantitative m −vt k × 100 evaluation of tracking results; we use % − e = kvkv tk where vm and vt are measured and ground-truth of translation and rotation vectors, respectively. As the noise level increases, the errors also increase as shown in Table 1. However, tracking is still maintained throughout all sequences and all errors stayed less than 2% and 9% for translation and rotation, respectively. Figure 4 shows tracking results of the several sequences in which drastic changes of an objects’ pose caused by a moving camera were captured. As one can see in Figure 4, the pose of objects was successfully obtained despite the dynamic pose change and the cluttered environment. Note that severe scale changes of the red car were observed in Figure 4(c). In Figure 4(d), the track of the elephant is maintained under the occlusion with the marker throughout the sequence. [5] S.Dambreville, R.Sandhu, Anthony Yezzi, and A.Tannenbaum, “A geometric approach to joint 2d region-based segmentation and 3d pose estimation using a 3d shape prior,” Society of Industrial and Applied Mathematics Journal on Imaging Sciences, 2009. [6] A. Srivastava, “Bayesian filtering for tracking pose and location of rigid targets,” Proceedings of SPIE Signal Processing, Sensor Fusion, and Target Recognition, vol. 4052, pp. 160– 171, 2000. [7] J. Kwon and F.C. Park, “Visual tracking via particle filtering on the affine group,” The International Journal of Robotics Research, vol. 29, no. 2-3, pp. 198, 2010. [8] E. Bayro-Corrochano and J. Orteg´on-Aguilar, “Lie algebra approach for tracking and 3D motion estimation using monocular vision,” Image and Vision Computing, vol. 25, no. 6, pp. 907– 921, 2007. [9] M. Moakher, “Means and averaging in the group of rotations,” SIAM Journal on Matrix Analysis and Applications, vol. 24, no. 1, pp. 1–16, 2003. [10] T. Kailath, “The divergence and Bhattacharyya distance measures in signal selection,” IEEE Transactions on Communication Technology, vol. 15, no. 1, pp. 52–60, 1967. [11] P. Perez, C. Hue, J. Vermaak, and M. Gangnet, “Color-based probabilistic tracking,” in Proc. European Conference on Computer Vision, 2002, vol. I, pp. 661–675.
© Copyright 2024