N. Siddharth, Andrei Barbu, & Jeffrey Mark Siskind N. Siddharth

S EEING W HAT Y OU ’ RE T OLD : S ENTENCE -G UIDED A CTIVITY R ECOGNITION I N V IDEO
N. Siddharth, Andrei Barbu, & Jeffrey Mark Siskind
Stanford University, MIT, Purdue University
How?
What?
en
Giv
jL = 1
jL = 2
jL = 3
The person to the left of the chair put down the blue object.
ize
n
g
co
Re
t=1
t=2
t=3
b11
b21
b31
···
bT1
kW = 1
b12
b22
b32
···
bT2
kW = 2
b23
..
.
..
.
..
.
b1J 1
b2J 2
b3J 3
f
Object Detectors
3
t
f (bjlt )
+
T
X
kW = 3
t=3
1, b1j 1 , . . . , b1j 1
1, b2j 2 , . . . , b2j 2
1, b3j 3 , . . . , b3j 3
θ1
W
t=2
Joint
L X
T
X
j1 ,...,j1 k1 ,...,k1
T l=1 t=1
1
,...,kW
jL1 ,...,jLT kW
f (btjlt ) +
T
X
θ1
W
Is
W
W
θ1
W
θ
θ1
W
3, b1j 1 , . . . , b1j 1
θ1
W
kW = KsW K
1
sW , bj 1
θ1
W
T
X
t
g(bt−1
,
b
t) +
j t−1 jl
t=2
θ
t=T
Is
W
W
2, b3j 3 , . . . , b3j 3
θ1
W
Is
θ W
W
θ
W X
T
X
θ1
W
Is
W
W
θ
Is
W
W
3, b3j 3 , . . . , b3j 3
Is
θ W
W
θ1
W
θ
..
.
, . . . , b1j 1
θ1
W
Is
W
W
···
1, bTjT , . . . , bTjT
···
2, bTjT , . . . , bTjT
···
3, bTjT , . . . , bTjT
θ
KsW , b3j 3 , . . . , b3j 3
θ1
W
Is
W
W
Is
θ W
W
···
• Joint evaluation of both Tracking and Activity Recognition
– A Tracker for each participant in the activity
– A Word Model for each lexical entry in the lexicon
agent-tracker
hw(kwt , btj t1 , btj t2 ) +
θw
θw
hw(kwt , btj t1 , btj t2 ) +
θw
T
X
θw
T
X
θ
Sentential Focus of Attention
Is
W
W
Same video | Different sentences → Different tracks
KsW , bTjT , . . . , bTjT
θ1
W
θ
Is
W
W
aw(kwt−1, kwt )
The person carried an object towards the trash can.
t=2
The person carried an object away from the trash can.
Semantic Video Retrieval
Many videos | A sentence → That video
goal-tracker
...
object 0
object 1
object 2
2
The person carried the backpack away from the trash can.
Generation of Sentential Descriptions
0
3
1
Many sentences† | A video → That sentence
†
S
NP
D
A
N
PP
P
VP
V
ADV
PPM
PM
• Sensitive to sentence structure
The person approached
the object.
...
object 3
• Recognize different parts of speech
Adverbs
Adjectives
Prepositions
θ1
W
Is
W
W
t=2
patient-tracker
• Integration of top-down sentential information and
bottom-up tracker information
Verbs
Nouns
Determiners
θ
aw(kwt−1, kwt )
W words
referent-tracker
θ1
W
Is
W
W
Performing disparate tasks simply by leveraging the
framework differently
as W
The person to the left of the stool carried the traffic-cone towards the trash-can.
Encodes word meanings
over participant features
θ
..
.
Word Models
Sentence
θ1
W
..
.
KsW , b2j 2 , . . . , b2j 2
Is
θ W
W
w=1 t=1
L tracks
M
d
r
o
AW
3, b2j 2 , . . . , b2j 2
t=1
l
θ1
W
Is
θ W
W
2, b2j 2 , . . . , b2j 2
Is
W
W
..
.
bTJ T
···
θ
2, b1j 1 , . . . , b1j 1
..
.
t−1 t
g(bj t−1 , bjlt )
l
Simultaneous Tracking and
Activity Recognition
Lexicon
t=2
hsW
max
max
1
T
1
T
Trackers
bT3
···
t=1
odel
g
t=1
Encodes participant features
t=T
r
e
k
c
a
b3
A Tr
b13
T
X
Characteristics
Video
jL = J t
Look!
The object approached
6≡
the person.
→
→
→
→
→
→
→
→
→
→
→
→
NP VP
D [A] N [PP]
an | the
blue | red
person | backpack | trash-can | chair | object
P NP
to the left of | to the right of
V NP [ADV] [PPM]
picked up | put down | carried | approached
quickly | slowly
PM NP
towards | away from
The person to the right of the trash-can picked up the backpack.
This research was supported, in part, by ARL, under Cooperative Agreement Number W911NF-10-2-0060, and the
Center for Brains, Minds and Machines, funded by NSF STC award CCF-1231216. The views and conclusions contained
in this document are those of the authors and do not represent the official policies, either express or implied, of ARL
or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government
purposes, notwithstanding any copyright notation herein.
S : (B, s, Λ) → (τ, J)
B – detections, s – sentence, Λ – lexicon
τ – score, J – tracks
http://upplysingaoflun.ecn.purdue.edu/~qobi/cccp/
siddharth@iffsid.com
andrei@0xab.com
qobi@purdue.edu