Rotating Your Face Using Multi-Task Deep Neural Network (CVPR 2015) Junmo Kim Joint work with Junho Yim, Heechul Jung, ByungIn Yoo, Changkyu Choi, Dusik Park School of Electrical Engineering, KAIST Samsung Advanced Institute of Technology, SAIT 2015. 04. 24 Introduction Pose- and Illumination- invariant Face Recognition • Pose- and Illumination- invariant Face Recognition Pose Illumination 2 DeepFace (CVPR 2014) • Face recognition pipeline: • Detect align represent classify • Alignment: employed explicit 3D face modeling • Representation: a 9-layer deep neural network • More than 120 million parameters • Locally connected layers without weight sharing • Trained with 4 million labeled face images • 97% recognition rate on LFW Taigman et al. DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR’14 DeepID2 (NIPS2014) • 99.15 % face verification accuracy on LFW • Trained with CelebFaces+ dataset • 202,599 face images of 10,177 identities (celebrities) Sun et al. Deep Learning Face Representation by Joint Identification-Verification Identity-Preserving Face Transform (ICCV2013) • Deep network extracts face identity-preserving (FIP) features. • Frontal face image is synthesized from FIP. Zhu et al. Deep Learning Identity-Preserving Face Space, ICCV 2013 Proposed Model (Naïve Version) 6 Proposed Model (Naïve Version) • Pose- and illumination- invariant feature • Input image : arbitrary pose under arbitrary illumination Image • Output : generated desired pose images under frontal illumination Input Image Output Image −𝟔𝟔𝟔𝟔° −𝟒𝟒𝟒𝟒° 𝟎𝟎° 𝟏𝟏𝟏𝟏° 𝟑𝟑𝟑𝟑° −𝟔𝟔𝟔𝟔° −𝟒𝟒𝟒𝟒° 𝟎𝟎° 𝟏𝟏𝟏𝟏° 𝟑𝟑𝟑𝟑° 7 Proposed Model (Naïve Version) • Remote Code • represents the desired pose information • ex : 0 0 1 0 0 represents that “Rotate to 0.” • Training Layer v hidden1 hidden2 hidden3 output Input Image Output Image • Using Multi-PIE dataset • Input : 1. Arbitrary pose image label Remote Code 0 0 1 0 0 Simple repetition code DNN with All Fully Connected Layers 8 Proposed Model (Naïve Version) • Remote Code • represents the desired pose information • ex : 0 0 1 0 0 represents that “Rotate to 0.” • Training Layer v hidden1 hidden2 hidden3 output Input Image Output Image • Using Multi-PIE dataset • Input : 1. Arbitrary pose image Under arbitrary illumination label 2. Remote Code • Output : Desired pose image Remote Code 0 0 1 0 0 Under frontal illumination Simple repetition code DNN with All Fully Connected Layers 9 Proposed Model (Naïve Version) • Remote Code • Result 01000 00100 00100 01000 Input Image Remote Code Output Image Input Image Remote Code Output Image 10 Generating Multi-View Representation (NIPS 2014) • Z.Zhu et al. “Deep Learning Multi-View Representation for Face Recognition”, NIPS 2014 • Can generate multi-view images from single view face image • Several candidate face images are generated and the best fit is selected The inputs (first column) and the multi-view outputs (remaining columns) of two identities. = = = = input image output image view label of the output image random binary hidden neurons11 Proposed Model (Naïve Version) • Remote Code • Result Input 10000 01000 00100 00010 00001 • Problem • Identity preserving introduce one more task • May overfit because of using all fully connected layers use locally connected layer 12 Proposed Model (Refined Version) 13 Multi-task Learning • General Multi-task Learning • Shares some layers to determine common features • Remaining layers are split into multi-tasks S. Li, Z.-Q. Liu, and A. B. Chan. Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on, pages 488–495. IEEE, 2014. 14 Multi-task Learning • Proposed model Layer v hidden1 hidden2 output Input Image hidden3 hidden4 Output Image Input Image Remote Code 0 0 1 0 0 15 Multi-task Learning • Proposed Idea • Preserve identity by using the second task 16 Proposed Model (Refined Version) • Proposed model • Part 1 : Feature extraction part • Part 2 : Feature rotation part • Part 3 : Imaging part • Part 4 : Reconstruction part 17 Proposed Model (Refined Version) • Interesting Point • Locally connected layer • Using less weights than the fully connected layer • More suitable than Convolutional layer • Early Fully connected layer • Change features to contain the target pose information 18 Proposed Model (Refined Version) • Remote Code • Remote code, • Experimentally, how to put a code in the image is not important 19 Proposed Model (Refined Version) • RECON CODE • Recon code contains pose part, and illumination part, 20 Proposed Model (Refined Version) • Multi-task Learning • Cost Function • Squared 𝐿𝐿2 norm • Main task • Generate the desired pose Image • Auxiliary task • Reconstruct the input Image • Total Cost Function 21 Experimental Result 22 Experimental Result Experimental setting • Multi-PIE dataset • 337 subjects with 15 different poses under 20 illumination changes 23 Experimental Result Experimental setting • Experimental setting • Setting 1 • • • • 7 different poses (−45° ~ 45° ) under 20 different illuminations Training : 100 subjects ( 14000 images ) Testing : 149 subjects ( 16986 images except 0° 𝑎𝑎𝑎𝑎𝑎𝑎 frontal illumination) Gallery image : Each frontal image under frontal illumination • Setting 2 • • • • 9 different poses (−60° ~ 60° ) under 20 different illuminations Training : 200 subjects ( 36000 images ) Testing : 137 subjects ( 24660 images ) Gallery image : Each frontal image under frontal illumination 24 Experimental Result Experimental setting • Face Recognition Gallery Image • Find NN using 𝐿𝐿2 norm Synthesized Image Synthesized Image Synthesized Image n001 n001 n002 test n001 n003 n001 n001 1번 사람 정면 7번 조도 n149 2번 사람 정면 7번 조도 n001 Result : n001 25 Experimental Result Feature space • Feature space Target Pose 𝟎𝟎° 𝟏𝟏𝟏𝟏° 𝟑𝟑𝟑𝟑° 𝟒𝟒𝟒𝟒° 𝟔𝟔𝟔𝟔° . . . . . . . . . Test Image −𝟔𝟔𝟔𝟔° −𝟒𝟒𝟒𝟒° −𝟑𝟑𝟑𝟑° −𝟏𝟏𝟏𝟏° .. .. .. .. .. .. .. .. .. 26 Experimental Result Feature space • Feature space CPF CPI 27 Experimental Result Comparison with the state-of-the-art • Face Recognition : Setting 1 • 7 poses, 20 illuminations • 100 subjects training , 149 subjects test (16986 images) • Recognition rates (%) for the various illuminations 0 1 2 3 4 5 6 8 9 10 Z.Zhu[1] 72.8 75.8 75.8 75.7 75.7 75.7 75.7 75.7 75.7 75.7 CPI 66.0 626 69.6 73.0 79.1 84.5 86.6 86.5 84.2 80.2 CPF 59.7 70.6 76.3 79.1 85.1 89.4 91.3 92.3 90.6 86.5 11 12 13 14 15 16 17 18 19 Avg. Z.Zhu[1] 75.7 75.7 75.7 73.4 73.4 73.4 73.4 72.9 72.9 74.7 CPI 76.0 70.8 65.7 76.1 78.2 80.7 79.4 77.3 65.4 75.9 CPF 81.2 77.5 72.8 82.3 84.2 86.5 85.9 82.9 59.2 80.7 [1] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning identity preserving face space. In ICCV, 2013. 28 Experimental Result Comparison with the state-of-the-art • Face Recognition : Setting 1 • 7 poses, 20 illuminations • 100 subjects training , 149 subjects test (16986 images) • Recognition rates (%) for the various poses -45 -30 -15 Z.Zhu[1] 67.1 74.6 CPI 66.6 CPF 73.0 0 15 30 45 Avg. 86.1 83.3 75.3 61.8 74.7 78.0 87.3 85.5 75.8 62.3 75.9 81.7 89.4 89.5 80.4 70.3 80.7 [1] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning identity preserving face space. In ICCV, 2013. 29 Experimental Result Comparison with the state-of-the-art • Face Recognition : Setting 2 • 9 poses, 20 illuminations • 200 subjects training , 137 subjects test (24660 images) • Recognition rates (%) for the various poses -60 -45 -30 -15 0 15 30 45 60 Avg. Z.Zhu[1] 44.6 63.6 77.5 90.5 94.3 89.8 80.0 56.5 38.9 70.8 Z.Zhu[2] 60.2 75.2 83.4 93.3 95.7 92.2 83.9 70.6 60.0 79.3 CPI 55.8 71.8 80.0 90.1 98.4 90.2 82.7 71.0 52.9 77.0 CPF 63.2 80.4 88.1 94.5 99.5 95.4 88.9 79.4 60.6 83.3 [2] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning multi-view representation for face recognition. In NIPS 2014 30 Experimental Result Comparison with the Single task • Superiority of the Multi-task learning Multi-task Learning Single task Learning 31 Experimental Result Comparison with the Single task • Superiority of the Multi-task learning • Face Recognition : Setting 2 0 1 2 3 4 5 6 8 9 10 Single 45.4 64.3 72.9 74.9 82.0 86.9 89.8 89.7 87.9 81.7 Multi 59.7 70.6 76.3 79.1 85.1 89.4 91.3 92.3 90.6 86.5 11 12 13 14 15 16 17 18 19 Avg. Single 76.5 72.2 66.7 76.9 80.9 82.7 79.9 76.5 47.1 75.5 Multi 81.2 77.5 72.8 82.3 84.2 86.5 85.9 82.9 59.2 80.7 32 Experimental Result Comparison with the Late fully connected • Superiority of the Early fully connected layer • Face Recognition : Setting 1 Early fully connected layer 83.3% Late fully connected layer 77.0% 33 Experimental Result Comparison with the general multi-task learning • Superiority of the proposed Multi-task learning Proposed General 2 General 1 General 3 34 Experimental Result Comparison with the general multi-task learning • Superiority of the Multi-task learning • Face Recognition : Setting 1 -60 -45 -30 -15 0 15 30 45 60 Avg. Proposed 63.2 80.4 88.1 94.5 99.5 95.4 88.9 79.4 60.6 83.3 General1 26.9 61.7 73.2 87.4 99.1 87.5 72.4 59.2 28.1 66.3 General2 34.7 64.5 74.1 80.8 96.9 85.3 75.2 67.5 32.9 68.0 General3 56.4 75.9 85.0 92.5 98.4 92.7 83.7 76.0 56.6 79.7 35 Experimental Result • Generated Images 36 Conclusion • Propose the novel type of network that can synthesize the desired pose image by utilizing user’s Remote Code represents • Propose the novel type of Multi-task network that produces better performance at preserving identity • Clearly win against the previous state-of-the-art model by more than 4~6% 37 38 39 40 Rotating My Face with Multi-Task DNN -45 Original Image Aligned face -30 -15 0 15 30 45 Generated face 41 Appendix 1 • Z.Zhu et al. “Deep Learning Multi-View Representation for Face Recognition”, NIPS 2014 • = input image • = output image • = view label of the output image • = random binary hidden neurons • Sampled from a distribution • Training • learned by maximizing the data log-likelihood 42 43 Appendix 2 • The percentage of the number of CPFs contributing to final results 44 Appendix 3 • The feature space of 6000 features. • Each pale dot color represents a different Remote Code. • Same deep colors represent the features from a single identity. 45 Appendix 4 • Pose estimation • LR : Linear Regression • SVR : Support Vector Regression error Z.Zhu[2] LR SVR Setting1 Setting2 5.92 9.79 5.45 5.56 4.29 [2] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning multi-view representation for face recognition. In NIPS 2014 46
© Copyright 2025