Atention transfer from web images for video recognition

Junnan Li; Yongkang Wong; Qi Zhao; Mohan S. Kankanhalli

doi:10.1145/3123266.3123432

Atention transfer from web images for video recognition

Junnan Li, Yongkang Wong, Qi Zhao, Mohan S. Kankanhalli

Computer Science and Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

39 Scopus citations

Abstract

Training deep learning based video classifiers for action recognition requires a large amount of labeled videos. The labeling process is labor-intensive and time-consuming. On the other hand, large amount of weakly-labeled images are uploaded to the Internet by users everyday. To harness the rich and highly diverse set of Web images, a scalable approach is to crawl these images to train deep learning based classifier, such as Convolutional Neural Networks (CNN). However, due to the domain shift problem, the performance of Web images trained deep classifiers tend to degrade when directly deployed to videos. One way to address this problem is to fine-tune the trained models on videos, but sufficient amount of annotated videos are still required. In this work, we propose a novel approach to transfer knowledge from image domain to video domain. The proposed method can adapt to the target domain (i.e. video data) with limited amount of training data. Our method maps the video frames into a low-dimensional feature space using the class-discriminative spatial attention map for CNNs. We design a novel Siamese EnergyNet structure to learn energy functions on the attention maps by jointly optimizing two loss functions, such that the attention map corresponding to a ground truth concept would have higher energy. We conduct extensive experiments on two challenging video recognition datasets (i.e. TVHI and UCF101), and demonstrate the efficacy of our proposed method.

Original language	English (US)
Title of host publication	MM 2017 - Proceedings of the 2017 ACM Multimedia Conference
Publisher	Association for Computing Machinery, Inc
Pages	1-9
Number of pages	9
ISBN (Electronic)	9781450349062
DOIs	https://doi.org/10.1145/3123266.3123432
State	Published - Oct 23 2017
Event	25th ACM International Conference on Multimedia, MM 2017 - Mountain View, United States Duration: Oct 23 2017 → Oct 27 2017

Publication series

Name	MM 2017 - Proceedings of the 2017 ACM Multimedia Conference

Other

Other	25th ACM International Conference on Multimedia, MM 2017
Country/Territory	United States
City	Mountain View
Period	10/23/17 → 10/27/17

Bibliographical note

Funding Information:
This research is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its International Research Centre in Singapore Funding Initiative.

Publisher Copyright:
© 2017 Association for Computing Machinery.

Keywords

Action recognition
Attention map
Domain adaptation

Access

10.1145/3123266.3123432

OpenUrl availability

Full text

Cite this

Atention transfer from web images for video recognition. / Li, Junnan; Wong, Yongkang; Zhao, Qi et al.
MM 2017 - Proceedings of the 2017 ACM Multimedia Conference. Association for Computing Machinery, Inc, 2017. p. 1-9 (MM 2017 - Proceedings of the 2017 ACM Multimedia Conference).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Li, J, Wong, Y, Zhao, Q & Kankanhalli, MS 2017, Atention transfer from web images for video recognition. in MM 2017 - Proceedings of the 2017 ACM Multimedia Conference. MM 2017 - Proceedings of the 2017 ACM Multimedia Conference, Association for Computing Machinery, Inc, pp. 1-9, 25th ACM International Conference on Multimedia, MM 2017, Mountain View, United States, 10/23/17. https://doi.org/10.1145/3123266.3123432

@inproceedings{a06aa26c1c32401da48317d6952cb052,

title = "Atention transfer from web images for video recognition",

abstract = "Training deep learning based video classifiers for action recognition requires a large amount of labeled videos. The labeling process is labor-intensive and time-consuming. On the other hand, large amount of weakly-labeled images are uploaded to the Internet by users everyday. To harness the rich and highly diverse set of Web images, a scalable approach is to crawl these images to train deep learning based classifier, such as Convolutional Neural Networks (CNN). However, due to the domain shift problem, the performance of Web images trained deep classifiers tend to degrade when directly deployed to videos. One way to address this problem is to fine-tune the trained models on videos, but sufficient amount of annotated videos are still required. In this work, we propose a novel approach to transfer knowledge from image domain to video domain. The proposed method can adapt to the target domain (i.e. video data) with limited amount of training data. Our method maps the video frames into a low-dimensional feature space using the class-discriminative spatial attention map for CNNs. We design a novel Siamese EnergyNet structure to learn energy functions on the attention maps by jointly optimizing two loss functions, such that the attention map corresponding to a ground truth concept would have higher energy. We conduct extensive experiments on two challenging video recognition datasets (i.e. TVHI and UCF101), and demonstrate the efficacy of our proposed method.",

keywords = "Action recognition, Attention map, Domain adaptation",

author = "Junnan Li and Yongkang Wong and Qi Zhao and Kankanhalli, {Mohan S.}",

note = "Funding Information: This research is supported by the National Research Foundation, Prime Minister{\textquoteright}s Office, Singapore under its International Research Centre in Singapore Funding Initiative. Publisher Copyright: {\textcopyright} 2017 Association for Computing Machinery.; 25th ACM International Conference on Multimedia, MM 2017 ; Conference date: 23-10-2017 Through 27-10-2017",

year = "2017",

month = oct,

day = "23",

doi = "10.1145/3123266.3123432",

language = "English (US)",

series = "MM 2017 - Proceedings of the 2017 ACM Multimedia Conference",

publisher = "Association for Computing Machinery, Inc",

pages = "1--9",

booktitle = "MM 2017 - Proceedings of the 2017 ACM Multimedia Conference",

}

TY - GEN

T1 - Atention transfer from web images for video recognition

AU - Li, Junnan

AU - Wong, Yongkang

AU - Zhao, Qi

AU - Kankanhalli, Mohan S.

N1 - Funding Information: This research is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its International Research Centre in Singapore Funding Initiative. Publisher Copyright: © 2017 Association for Computing Machinery.

PY - 2017/10/23

Y1 - 2017/10/23

N2 - Training deep learning based video classifiers for action recognition requires a large amount of labeled videos. The labeling process is labor-intensive and time-consuming. On the other hand, large amount of weakly-labeled images are uploaded to the Internet by users everyday. To harness the rich and highly diverse set of Web images, a scalable approach is to crawl these images to train deep learning based classifier, such as Convolutional Neural Networks (CNN). However, due to the domain shift problem, the performance of Web images trained deep classifiers tend to degrade when directly deployed to videos. One way to address this problem is to fine-tune the trained models on videos, but sufficient amount of annotated videos are still required. In this work, we propose a novel approach to transfer knowledge from image domain to video domain. The proposed method can adapt to the target domain (i.e. video data) with limited amount of training data. Our method maps the video frames into a low-dimensional feature space using the class-discriminative spatial attention map for CNNs. We design a novel Siamese EnergyNet structure to learn energy functions on the attention maps by jointly optimizing two loss functions, such that the attention map corresponding to a ground truth concept would have higher energy. We conduct extensive experiments on two challenging video recognition datasets (i.e. TVHI and UCF101), and demonstrate the efficacy of our proposed method.

AB - Training deep learning based video classifiers for action recognition requires a large amount of labeled videos. The labeling process is labor-intensive and time-consuming. On the other hand, large amount of weakly-labeled images are uploaded to the Internet by users everyday. To harness the rich and highly diverse set of Web images, a scalable approach is to crawl these images to train deep learning based classifier, such as Convolutional Neural Networks (CNN). However, due to the domain shift problem, the performance of Web images trained deep classifiers tend to degrade when directly deployed to videos. One way to address this problem is to fine-tune the trained models on videos, but sufficient amount of annotated videos are still required. In this work, we propose a novel approach to transfer knowledge from image domain to video domain. The proposed method can adapt to the target domain (i.e. video data) with limited amount of training data. Our method maps the video frames into a low-dimensional feature space using the class-discriminative spatial attention map for CNNs. We design a novel Siamese EnergyNet structure to learn energy functions on the attention maps by jointly optimizing two loss functions, such that the attention map corresponding to a ground truth concept would have higher energy. We conduct extensive experiments on two challenging video recognition datasets (i.e. TVHI and UCF101), and demonstrate the efficacy of our proposed method.

KW - Action recognition

KW - Attention map

KW - Domain adaptation

UR - http://www.scopus.com/inward/record.url?scp=85035193802&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85035193802&partnerID=8YFLogxK

U2 - 10.1145/3123266.3123432

DO - 10.1145/3123266.3123432

M3 - Conference contribution

AN - SCOPUS:85035193802

T3 - MM 2017 - Proceedings of the 2017 ACM Multimedia Conference

SP - 1

EP - 9

BT - MM 2017 - Proceedings of the 2017 ACM Multimedia Conference

PB - Association for Computing Machinery, Inc

T2 - 25th ACM International Conference on Multimedia, MM 2017

Y2 - 23 October 2017 through 27 October 2017

ER -

Atention transfer from web images for video recognition

Abstract

Publication series

Other

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this