S EEING W HAT Y OU ’ RE T OLD : S ENTENCE -G UIDED A CTIVITY R ECOGNITION I N V IDEO N. Siddharth, Andrei Barbu, & Jeffrey Mark Siskind Stanford University, MIT, Purdue University How? What? en Giv jL = 1 jL = 2 jL = 3 The person to the left of the chair put down the blue object. ize n g co Re t=1 t=2 t=3 b11 b21 b31 ··· bT1 kW = 1 b12 b22 b32 ··· bT2 kW = 2 b23 .. . .. . .. . b1J 1 b2J 2 b3J 3 f Object Detectors 3 t f (bjlt ) + T X kW = 3 t=3 1, b1j 1 , . . . , b1j 1 1, b2j 2 , . . . , b2j 2 1, b3j 3 , . . . , b3j 3 θ1 W t=2 Joint L X T X j1 ,...,j1 k1 ,...,k1 T l=1 t=1 1 ,...,kW jL1 ,...,jLT kW f (btjlt ) + T X θ1 W Is W W θ1 W θ θ1 W 3, b1j 1 , . . . , b1j 1 θ1 W kW = KsW K 1 sW , bj 1 θ1 W T X t g(bt−1 , b t) + j t−1 jl t=2 θ t=T Is W W 2, b3j 3 , . . . , b3j 3 θ1 W Is θ W W θ W X T X θ1 W Is W W θ Is W W 3, b3j 3 , . . . , b3j 3 Is θ W W θ1 W θ .. . , . . . , b1j 1 θ1 W Is W W ··· 1, bTjT , . . . , bTjT ··· 2, bTjT , . . . , bTjT ··· 3, bTjT , . . . , bTjT θ KsW , b3j 3 , . . . , b3j 3 θ1 W Is W W Is θ W W ··· • Joint evaluation of both Tracking and Activity Recognition – A Tracker for each participant in the activity – A Word Model for each lexical entry in the lexicon agent-tracker hw(kwt , btj t1 , btj t2 ) + θw θw hw(kwt , btj t1 , btj t2 ) + θw T X θw T X θ Sentential Focus of Attention Is W W Same video | Different sentences → Different tracks KsW , bTjT , . . . , bTjT θ1 W θ Is W W aw(kwt−1, kwt ) The person carried an object towards the trash can. t=2 The person carried an object away from the trash can. Semantic Video Retrieval Many videos | A sentence → That video goal-tracker ... object 0 object 1 object 2 2 The person carried the backpack away from the trash can. Generation of Sentential Descriptions 0 3 1 Many sentences† | A video → That sentence † S NP D A N PP P VP V ADV PPM PM • Sensitive to sentence structure The person approached the object. ... object 3 • Recognize different parts of speech Adverbs Adjectives Prepositions θ1 W Is W W t=2 patient-tracker • Integration of top-down sentential information and bottom-up tracker information Verbs Nouns Determiners θ aw(kwt−1, kwt ) W words referent-tracker θ1 W Is W W Performing disparate tasks simply by leveraging the framework differently as W The person to the left of the stool carried the traffic-cone towards the trash-can. Encodes word meanings over participant features θ .. . Word Models Sentence θ1 W .. . KsW , b2j 2 , . . . , b2j 2 Is θ W W w=1 t=1 L tracks M d r o AW 3, b2j 2 , . . . , b2j 2 t=1 l θ1 W Is θ W W 2, b2j 2 , . . . , b2j 2 Is W W .. . bTJ T ··· θ 2, b1j 1 , . . . , b1j 1 .. . t−1 t g(bj t−1 , bjlt ) l Simultaneous Tracking and Activity Recognition Lexicon t=2 hsW max max 1 T 1 T Trackers bT3 ··· t=1 odel g t=1 Encodes participant features t=T r e k c a b3 A Tr b13 T X Characteristics Video jL = J t Look! The object approached 6≡ the person. → → → → → → → → → → → → NP VP D [A] N [PP] an | the blue | red person | backpack | trash-can | chair | object P NP to the left of | to the right of V NP [ADV] [PPM] picked up | put down | carried | approached quickly | slowly PM NP towards | away from The person to the right of the trash-can picked up the backpack. This research was supported, in part, by ARL, under Cooperative Agreement Number W911NF-10-2-0060, and the Center for Brains, Minds and Machines, funded by NSF STC award CCF-1231216. The views and conclusions contained in this document are those of the authors and do not represent the official policies, either express or implied, of ARL or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation herein. S : (B, s, Λ) → (τ, J) B – detections, s – sentence, Λ – lexicon τ – score, J – tracks http://upplysingaoflun.ecn.purdue.edu/~qobi/cccp/ siddharth@iffsid.com andrei@0xab.com qobi@purdue.edu
© Copyright 2025