Anticipating Where People will Look Using Adversarial Networks

Mengmi Zhang; Keng Teck Ma; Joo Hwee Lim; Qi Zhao; Jiashi Feng

doi:10.1109/TPAMI.2018.2871688

Anticipating Where People will Look Using Adversarial Networks

Mengmi Zhang, Keng Teck Ma, Joo Hwee Lim, Qi Zhao, Jiashi Feng

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

12 Scopus citations

Abstract

We introduce a new problem of gaze anticipation on future frames which extends the conventional gaze prediction problem to go beyond current frames. To solve this problem, we propose a new generative adversarial network based model, Deep Future Gaze (DFG), encompassing two pathways: DFG-P is to anticipate gaze prior maps conditioned on the input frame which provides task influences; DFG-G is to learn to model both semantic and motion information in future frame generation. DFG-P and DFG-G are then fused to anticipate future gazes. DFG-G consists of two networks: a generator and a discriminator. The generator uses a two-stream spatial-temporal convolution architecture (3D-CNN) for explicitly untangling the foreground and background to generate future frames. It then attaches another 3D-CNN for gaze anticipation based on these synthetic frames. The discriminator plays against the generator by distinguishing the synthetic frames of the generator from the real frames. Experimental results on the publicly available egocentric and third person video datasets show that DFG significantly outperforms all competitive baselines. We also demonstrate that DFG achieves better performance of gaze prediction on current frames in egocentric and third person videos than state-of-the-art methods.

Original language	English (US)
Article number	8471119
Pages (from-to)	1783-1796
Number of pages	14
Journal	IEEE Transactions on Pattern Analysis and Machine Intelligence
Volume	41
Issue number	8
DOIs	https://doi.org/10.1109/TPAMI.2018.2871688
State	Published - Aug 1 2019

Bibliographical note

Funding Information:
This work was supported by the Reverse Engineering Visual Intelligence for cognitiVe Enhancement (REVIVE) programme (1335H00098) funded by A*STAR, National University of Singapore startup grant R-263-000-C08-133 and Ministry of Education of Singapore AcRF Tier One grant R-263-000-C21-112. We also like to thank Yin Li, Sayed Hossein Khatoonabadi, and Victor Leboran for their help in replicating the experimental setups in [3], [33], [34].

Publisher Copyright:
© 2018 IEEE.

Keywords

Egocentric videos
gaze anticipation
generative adversarial network
saliency
visual attention

Access

10.1109/TPAMI.2018.2871688

OpenUrl availability

Full text

Cite this

@article{c3ca8c384c2a4074a9e3c27a7c18d14c,

title = "Anticipating Where People will Look Using Adversarial Networks",

abstract = "We introduce a new problem of gaze anticipation on future frames which extends the conventional gaze prediction problem to go beyond current frames. To solve this problem, we propose a new generative adversarial network based model, Deep Future Gaze (DFG), encompassing two pathways: DFG-P is to anticipate gaze prior maps conditioned on the input frame which provides task influences; DFG-G is to learn to model both semantic and motion information in future frame generation. DFG-P and DFG-G are then fused to anticipate future gazes. DFG-G consists of two networks: a generator and a discriminator. The generator uses a two-stream spatial-temporal convolution architecture (3D-CNN) for explicitly untangling the foreground and background to generate future frames. It then attaches another 3D-CNN for gaze anticipation based on these synthetic frames. The discriminator plays against the generator by distinguishing the synthetic frames of the generator from the real frames. Experimental results on the publicly available egocentric and third person video datasets show that DFG significantly outperforms all competitive baselines. We also demonstrate that DFG achieves better performance of gaze prediction on current frames in egocentric and third person videos than state-of-the-art methods.",

keywords = "Egocentric videos, gaze anticipation, generative adversarial network, saliency, visual attention",

author = "Mengmi Zhang and Ma, {Keng Teck} and Lim, {Joo Hwee} and Qi Zhao and Jiashi Feng",

note = "Funding Information: This work was supported by the Reverse Engineering Visual Intelligence for cognitiVe Enhancement (REVIVE) programme (1335H00098) funded by A*STAR, National University of Singapore startup grant R-263-000-C08-133 and Ministry of Education of Singapore AcRF Tier One grant R-263-000-C21-112. We also like to thank Yin Li, Sayed Hossein Khatoonabadi, and Victor Leboran for their help in replicating the experimental setups in [3], [33], [34]. Publisher Copyright: {\textcopyright} 2018 IEEE.",

year = "2019",

month = aug,

day = "1",

doi = "10.1109/TPAMI.2018.2871688",

language = "English (US)",

volume = "41",

pages = "1783--1796",

journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence",

issn = "0162-8828",

publisher = "IEEE Computer Society",

number = "8",

}

TY - JOUR

T1 - Anticipating Where People will Look Using Adversarial Networks

AU - Zhang, Mengmi

AU - Ma, Keng Teck

AU - Lim, Joo Hwee

AU - Zhao, Qi

AU - Feng, Jiashi

N1 - Funding Information: This work was supported by the Reverse Engineering Visual Intelligence for cognitiVe Enhancement (REVIVE) programme (1335H00098) funded by A*STAR, National University of Singapore startup grant R-263-000-C08-133 and Ministry of Education of Singapore AcRF Tier One grant R-263-000-C21-112. We also like to thank Yin Li, Sayed Hossein Khatoonabadi, and Victor Leboran for their help in replicating the experimental setups in [3], [33], [34]. Publisher Copyright: © 2018 IEEE.

PY - 2019/8/1

Y1 - 2019/8/1

N2 - We introduce a new problem of gaze anticipation on future frames which extends the conventional gaze prediction problem to go beyond current frames. To solve this problem, we propose a new generative adversarial network based model, Deep Future Gaze (DFG), encompassing two pathways: DFG-P is to anticipate gaze prior maps conditioned on the input frame which provides task influences; DFG-G is to learn to model both semantic and motion information in future frame generation. DFG-P and DFG-G are then fused to anticipate future gazes. DFG-G consists of two networks: a generator and a discriminator. The generator uses a two-stream spatial-temporal convolution architecture (3D-CNN) for explicitly untangling the foreground and background to generate future frames. It then attaches another 3D-CNN for gaze anticipation based on these synthetic frames. The discriminator plays against the generator by distinguishing the synthetic frames of the generator from the real frames. Experimental results on the publicly available egocentric and third person video datasets show that DFG significantly outperforms all competitive baselines. We also demonstrate that DFG achieves better performance of gaze prediction on current frames in egocentric and third person videos than state-of-the-art methods.

AB - We introduce a new problem of gaze anticipation on future frames which extends the conventional gaze prediction problem to go beyond current frames. To solve this problem, we propose a new generative adversarial network based model, Deep Future Gaze (DFG), encompassing two pathways: DFG-P is to anticipate gaze prior maps conditioned on the input frame which provides task influences; DFG-G is to learn to model both semantic and motion information in future frame generation. DFG-P and DFG-G are then fused to anticipate future gazes. DFG-G consists of two networks: a generator and a discriminator. The generator uses a two-stream spatial-temporal convolution architecture (3D-CNN) for explicitly untangling the foreground and background to generate future frames. It then attaches another 3D-CNN for gaze anticipation based on these synthetic frames. The discriminator plays against the generator by distinguishing the synthetic frames of the generator from the real frames. Experimental results on the publicly available egocentric and third person video datasets show that DFG significantly outperforms all competitive baselines. We also demonstrate that DFG achieves better performance of gaze prediction on current frames in egocentric and third person videos than state-of-the-art methods.

KW - Egocentric videos

KW - gaze anticipation

KW - generative adversarial network

KW - saliency

KW - visual attention

UR - http://www.scopus.com/inward/record.url?scp=85054626845&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054626845&partnerID=8YFLogxK

U2 - 10.1109/TPAMI.2018.2871688

DO - 10.1109/TPAMI.2018.2871688

M3 - Article

C2 - 30273143

AN - SCOPUS:85054626845

SN - 0162-8828

VL - 41

SP - 1783

EP - 1796

JO - IEEE Transactions on Pattern Analysis and Machine Intelligence

JF - IEEE Transactions on Pattern Analysis and Machine Intelligence

IS - 8

M1 - 8471119

ER -

Anticipating Where People will Look Using Adversarial Networks

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this