Slides

PASCAL VOC Classification: Local Features vs. Deep Features
Shuicheng YAN, NUS
PASCAL VOC
Why valuable? Multi-label, Real Scenarios!
Visual Object
Recognition
Object
Classification
Object
Detection
Object
Segmentation
Person, Horse,
Barrier, Table, etc

PASCAL VOC Visual object classes challenges




Main tasks: object classification, detection and segmentation


Be held yearly 2007 – 2012
Tens of teams from universities and industries participated including INRIA,
Berkeley, Oxford, NEC, etc.
Become “the dataset” for visual object recognition research
Other tasks: person layout, action recognition, etc.
Data: 20 object classes, ~23,000 images with fine labeling
PASCAL VOC: 2010-2014

NUS-(PSL) team results





2014, Classification MAP to 0.91
2012, 2011, 2010, Winner of object classification task. (cls)
2012, Winner of object segmentation task. (seg)
2010, Honorable mention of object detection task. (det)
NUS-(PSL) architecture

Visual
Object
Recognition
A joint learning of cls-det-seg.
Cls: Global
Information
Det: Local
Information
Seg: Finedetailed
Information
PASCAL VOC: 2010-2014
HCP
2014: 91.4%
Deep feature
2014: 83.2%
Sub-category
2012: 82.2%
GHM
2011: 78.7%
Context-SVM
LLC
2010: 73.8%
25%
Deep feature
2013: 79.0%
I. Spring of Local Features: 2010-2012
Pipeline
Feature
Representation
Low Level
Features
Feature
Encoding
Feature
Pooling
GHM[2]:
Generalized Hierarchical
Matching (GHM) for
object central problems.
Object central pooling.
Model
Learning
Classifier
Learning
Subcategory mining[1]:
Automatically mining the
visual subcategories
based on ambiguity
modeling.
Context
Modeling
Contextualization[3]:
Mutual Contextualization
for object classification and
detection tasks. Great
performance improvement.
1. Jian Dong, Qiang Chen, Jiashi Feng, Wei Xia, Zhongyang Huang, Shuicheng YAN, Subcategory-aware Object Classification. In CVPR'13.
2. Qiang Chen, Zheng Song, Yang Hua, Zhongyang Huang, Shuicheng Yan. Hierarchical Matching with Side Information
for Image Classification. In CVPR’12.
3. Zheng Song*, Qiang Chen*, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and
Classification. In CVPR'11.
Framework – NUS-PSL 2010
Visual Features
Chair
Local Feature
Extraction
Feature Coding
Kernel
Nonlinear
Kernel
Classification
Post Processing
SVM
Kernel
Regression
Regression
Feature Pooling
SPM
Detection Results
Max pooling
Linear Kernel
Confidence
Refinement
with Exclusive
prior
Framework – NUS-PSL 2012
Subcategory
Mining
Chair
III Subcategory
Flipping
Mining
Visual Features
Local Feature
Extraction
Feature Coding
Feature Coding
FK
Flipping
Kernel
Nonlinear+
Nonlinear
Kernel
Linear
Kernel
I Contextualized
Object
Classification and
Detection
Classification
Post Processing
SVM
Kernel
Regression
Regression
Feature Pooling
Generalized
SPM,
SPM
GHM
II
Hierarchical
Matching
Subcategory
Detection Results
Detection Results
Flipping
Max pooling
Linear Kernel
Confidence
Refinement
with Exclusive
prior
Outline for VOC: 2010-2012



Context model: Contextualized Object Classification and
Detection
Feature pooling: Generalized Hierarchical Matching/Pooling
Subcategory learning: Sub-Category Aware Detection &
Classification
Contextualized Object Classification and Detection
Det: Local
patches with
matched local
shape/texture
Cls: Global
probabilities to
contain objects
Occurrence Probability
Det
Cls
Whether Can
Exchange
Information?
Observations


Object classification and detection are mutually
complemental to each other. Each subject task serves
as context task for the other.
Context is not robust for the subject task, so use only
when necessary
person
Scene/Global level information is
not stable for object detection.
False alarm of object detection
harms object classification.
Contextualized SVM - Formulation
• Adaptive contextualization
Sample specific
classification
Adaptive embedding
of context features
Original classification
hyperplane
n: feature dim
m: context dim
• Configurable model complexity: low rank constraint
•
dim n x m  R x (n + m)
Context model
(dim m)
• Easy to be solved and kernelized, if
Selection to
ambiguous samples
(dim n)
is fixed.
Contextualized SVM - Formulation

Ambiguity modeling:

Define the ambiguity degree of sample as the hinge loss of the subject task,

Learn the Ambiguity-guided Mixture Model (AMM) through EM to maximize the
following objective,

Multi-mode ambiguity term is defined as the posterior of each mixture r,
Iterative Co-training of
Detection and Classification
Learn to Detect
Classification Pipeline
Detection Pipeline
Context from
initial
Classification
Initial
Model
Detection
Feature
Detection Feature
Context from
1st
Classification
Context
SVM
Learn to Classify
Context from
initial
Detection
Context
SVM
Classification
Feature
Context
from 1st
Detection
Classification Feature
a) initial model
b) 1st iteration of ContextSVM
c) 2nd iteration of ContextSVM
…
Results

Iterative contextualization:
 Mean
AP values of 20 classes on VOC 2010 train/val
Results

Comparison with state-of-the-arts on VOC 2010
Exemplar results
Representative examples of the baseline (without contextualization) and
Context-SVM for classification task.
Outline for VOC: 2010-2012



Context model: Contextualized Object Classification and
Detection
Feature pooling: Generalized Hierarchical Matching/Pooling
Subcategory learning: Sub-Category Aware Detection &
Classification
Generalized Hierarchical Matching/Pooling


Traditional Pooling: SPM = approximate geometric constraint
Not optimal for object recognition due to misalignment
(a) Images
(b) SPM partitions
(c) Object Confidence Map partition
Hierarchical Pooling for Image Classification

Design a general form of hierarchical matching with
side information.
Represent image with hierarchical structure
Hierarchical Matching Kernel



Image Similarity Kernel is defined as the weighted
sum over each cluster kernel.
General form of SPM, PMK, etc…
Flexible to integrate other side information.
Generalized Hierarchical Matching/Pooling
Encoded local feature
vs. side information
(a) Side information
and Image
(c) Hierarchical structure
representation
(b) Hierarchically cluster by side information.
Level 1 (top),2 (mid),3 (bottom)
(d) Matching/pooling
within each cluster
Utilize side information to hierarchically pool local features
Side information design
Side Information - Detection Confidence Map
Images
Sliding
window
Process
Shape
Model
sub-window
Score vote
back to image
Score vote
back to image
Fusing
Object
Confidence
Maps
Appearance
Model
Results

VOC
Outline for VOC: 2010-2012



Context model: Contextualized Object Classification and
Detection
Feature pooling: Generalized Hierarchical Matching/Pooling
Subcategory learning: Sub-Category Aware Detection &
Classification
Sub-Category Mining
Chair
Sofa
Diningtable
Ambiguity Guided Subcategory Mining
Subcategory-aware Object Classification
Subcategory Model
1
Subcategory Model
2
Fusion Model
Subcategory Model
N
Sub-Category Mining
Ambiguous Categories
Sofa
Instance
Affinity Graph
Detected
Subgraphs
Corresponding
Subcategories
Ambiguity
Chair
Graph Shift
Visualization
Similarity
Ambiguity
Similarity

Subcategory Mining based on both Similarity & Ambiguity
Calculate the sample intra-class similarity

Calculate the sample inter-class ambiguity

Detect dense subgraphs by graph shift algorithm [1]

Subgraphs to subcategories.
[1] Hairong Liu, Shuicheng Yan. Robust Graph Mode Seeking by Graph Shift. ICML 2010
Sub-Category Aware Detection & Classification
Subcategory Model 1
Testing Image
Feature
Sliding/Selective
Extraction
Window Search
Local Feature
Extraction and
Coding
GHM Pooling
Detection Model
Image
Classification Model
Representation
Subcategory Model N
Subcategory
Classification
Result 1
Subcategory
Detection
Result 1
Subcategory
Classification
Result N
Subcategory
Detection
Result N
Fusion
Model
Category
level
Result
Sub-Category Mining Result
Subcategories
Bus
Chair
Outliers
Summary of VOC results
2010
Our Best Other's Best
aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
diningtable
dog
horse
motorbike
person
pottedplant
sheep
sofa
train
tvmonitor
MAP
93
79
71.6
77.8
54.3
85.2
78.6
78.8
64.5
64
62.9
69.6
82
84.4
91.6
48.6
65.4
59.6
89.4
77.2
73.8
93.3
77
69.9
77.2
53.7
85.9
80.4
79.4
62.9
66.2
61.1
71.1
76.7
81.7
90.2
53.3
66.3
58
87.5
76.2
2011
Our Best Other's Best
95.5
81.1
79.4
82.5
58.2
87.7
84.1
83.1
68.5
74.7
68.5
76.4
83.3
87.5
92.8
56.5
77.7
67
91.2
77.5
78.7
94.5
82.6
79.4
80.7
57.8
87.8
85.5
83.9
66.6
74.2
69.4
75.2
83
88.1
93.5
58.7
75.5
66.3
90
77.2
2012
Our Best Other's Best
97.3
84.2
80.8
85.3
60.8
89.9
86.8
89.3
75.4
77.8
75.1
83
87.5
90.1
95
57.8
79.2
73.4
94.5
80.7
82.2
92
74.2
73
77.5
54.3
85.2
81.9
76.4
65.2
63.2
68.5
68.9
78.2
81
91.6
55.9
69.4
65.4
86.7
77.4
II. Spring of Deep Feature: 2013-2014
CNN: Single-label Image Classification


Definition

Assign one and only one label from a pre-defined set to an image

Explicit assumption: object is roughly aligned
Alex Net [1] made a great breakthrough in single-label classification in
ILSVRC2012 (with 10% gain over the previous methods)
[1] A. Krizhevsky, I. Sutskever, G. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012.
CNN: Multi-label Image Classification

Definition

Assign multiple labels from a pre-defined set to an image
vs.
Single-label images

Multi-label images
Challenges

Foreground objects are not roughly aligned

Interactions between different objects, e.g. partial visibility and occlusion

A large number of training images are required

The label space is expanded from n to 2^n
Directly CNN training is unreasonable and unreliable!
Hypotheses-CNN-Pooling(HCP)

Our framework
Scores for individual
hypothesis
Shared convolutional neural network
5
3
11
5
dog,person,sheep
3
3
3
27
3
3
13
13
…
13
55
…
Hypotheses assumption:
single-labeled
c
Max 256 Max
96 Pooling Pooling
384
384
256
Max 4096 4096
Pooling
Max
Pooling
Characteristics of Our Framework





No ground-truth bounding box information is required for training on the
multi-label image dataset
The proposed HCP infrastructure is robust to the noisy and/or redundant
hypotheses
No explicit hypothesis label is required for training
The shared CNN can be well pre-trained with a large-scale single-label
image dataset
The HCP outputs are naturally multi-label prediction results
Training of HCP

Hypotheses extraction

Initialization of HCP


Pre-training on a large-scale single-label image set, e.g. ImageNet

Image-fine-tuning on a multi-label image set
Hypotheses-fine-tuning
Hypotheses Extraction


Criteria:

High object detection recall rate

Small number of hypotheses

High computational efficiency
Solution: BING [2]+ Boxes clustering
[2] M.-M. Cheng, J. Warrell, W.-Y. Lin, and P.H.S.Torr. BING: Binarized normed gradients for objectness estimation at 300fps. CVPR 2014.
Hypotheses Extraction
Initialization of HCP
Pre-training
Step1
Single-label Images
(e.g. ImageNet)
…
Parameters transferring
Step2
Multi-label Images
(e.g. Pascal VOC)
…
Image-fine-tuning
Hypotheses-fine-tuning
Experimental Results

A subset from detection dataset of ILSVRC 2013 is used for BING training
Experimental Results

Performance on PASCAL VOC 2007
New
Experimental Results

Performance on PASCAL VOC 2012
New-1
New-2
Experimental Results

Complementary Analysis: Hand-crafted features vs. Deep features
Experimental Results

One test sample from VOC2007

500 hypotheses for each image, 1~1.5s
Generate hypotheses
…
Feed into the shared CNN
person
hors
e
car
person
…
Cross-hypothesis max-pooling
person
horse
car
New Result: “Network in Network” (NIN)

NIN: CNN with non-linear filters, yet without final fully-connected NN layer
CNN

Intuitively less overfitting globally, and more discriminative locally
(not finally used in our submission due to the surgery of our main team member, but very effective)
[4]
With less parameter #
[4] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron C. Courville, Yoshua Bengio: Maxout
Networks. ICML (3) 2013: 1319-1327
NIN
Better Local Abstraction
Local patch is projected to its feature vector.
Using a small network.
Motivation: Better Local Abstraction!
Cascaded Cross Channel Parametric Pooling (CCCP)
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network In Network." ICLR-2014.
CCCP ≈ Cascaded 1x1 Convolution in Implementation
Global Average Pooling
CNN
NIN
Confidence map of each category
Save tons of parameters
NIN in ILSVR2014
To avoid hyper-parameter tuning,
we put cccp layer directly on convolution
layers of ZFNet. (Network in ZFNet)
layer
details
Conv1
Stride = 2, kernel = 7x7,
channel_out = 96
Cccp1
Output = 96
Conv2
Stride = 2, kernel = 5x5,
channel_out = 256
Cccp2
Output = 256
Conv3
Stride = 1, kernel = 3x3,
channel_out = 512
Cccp3
Output = 256
Conv4
Stride = 1, kernel = 3x3,
channel_out = 1024
Cccp4
Output = 512
Cccp5
Output = 384
Conv5
Stride = 1, kernel = 3x3,
channel_out = 512
Cccp6
Output = 256
layer
details
Conv1
Stride = 2, kernel = 7x7,
channel_out = 96
Conv2
Stride = 2, kernel = 5x5,
channel_out = 256
Conv3
Stride = 1, kernel = 3x3,
channel_out = 512
Conv4
Stride = 1, kernel = 3x3,
channel_out = 1024
Conv5
Stride = 1, kernel = 3x3,
channel_out = 512
Fc1
Output = 4096
Fc2
Output = 4096
Fc1
Output = 4096
Fc3
Output = 1000
Fc2
Output = 4096
Fc3
Output = 1000
(10.91%) With 256xN training and 3 view test
Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks."
Computer Vision–ECCV 2014. Springer International Publishing, 2014. 818-833.
NIN in HCP
Scores for individual
hypothesis
Shared NIN
dog,person,sheep
…
…
Max
Pooling
c
Compared with State-of-the-arts on VOC 2012
From 81.7%
Category
plane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
table
dog
horse
motor
person
plant
sheep
sofa
train
tv
MAP
NUS-PSL[1]
97.3
84.2
80.8
85.3
60.8
89.9
86.8
89.3
75.4
77.8
75.1
83.0
87.5
90.1
95.0
57.8
79.2
73.4
94.5
80.7
82.2
PRE-1000C[2]
93.5
78.4
87.7
80.9
57.3
85.0
81.6
89.4
66.9
73.8
62.0
89.5
83.2
87.6
95.8
61.4
79.0
54.3
88.0
78.3
78.7
PRE-1512[2]
94.6
82.9
88.2
84.1
60.3
89.0
84.4
90.7
72.1
86.8
69.0
92.1
93.4
88.6
96.1
64.3
86.6
62.3
91.1
79.8
82.8
| < 90.3%
Chatfield et al.[3] HCP-NIN HCP-NIN+NUS-PSL
96.8
98.4
99.5
82.5
89.5
93.7
91.5
96.2
96.8
88.1
91.7
94.0
62.1
72.5
77.7
88.3
91.1
95.3
81.9
87.2
92.4
94.8
97.1
98.2
70.3
73.0
86.1
80.2
89.5
91.3
76.2
75.1
83.5
92.9
96.3
97.3
90.3
93.0
96.8
89.3
90.5
96.3
95.2
94.8
95.8
57.4
66.5
72.2
83.6
90.3
91.5
66.4
65.8
81.1
93.5
95.6
97.6
81.9
82.0
90.0
83.2
86.8
91.4
[1] S. Yan, J. Dong, Q. Chen, Z. Song, Y. Pan, W. Xia, H. Zhongyang, Y. Hua, and S. Shen. Generalized hierarchical matching for subcategory aware
object classification. In Visual Recognition Challange workshop, ECCV, 2012.
[2] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. CVPR, 2014.
[3] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman. Return of the Devil in the Details: Delving Deep into Convolutional Nets , BMVC, 2014
Demo
Online Demo
Highest and Lowest Score Five Images for Each Class
Aeroplane
Bicycle
Bird
Boat
Bottle
Highest and Lowest Score Five Images for Each Class
Bus
Car
Cat
Chair
Cow
Highest and Lowest Score Five Images for Each Class
Dining table
Dog
Horse
Motorbike
Person
Highest and Lowest Score Five Images for Each Class
Pottedplant
Sheep
Sofa
Train
TV monitor
What’s next?
Better Solution for Small/Occluded Objects?
More Extra Data?
HCP
2014: 91.4%
Deep feature
2014: 83.2%
Better Local Features?
Sub-category
Better Deep Features?
GHM
2011: 78.7%
Context-SVM
LLC
2009: 66.5%
2010: 73.8%
2012: 82.2%
25%
Shuicheng YAN
eleyans@nus.edu.sg