Unsupervised Learning of Important Objects from First-Person Videos

Gedas Bertasius; Hyun Soo Park; Stella X. Yu; Jianbo Shi

doi:10.1109/ICCV.2017.216

Unsupervised Learning of Important Objects from First-Person Videos

Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

Computer Science and Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

18 Scopus citations

Abstract

A first-person camera, placed at a person's head, captures, which objects are important to the camera wearer. Most prior methods for this task learn to detect such important objects from the manually labeled first-person data in a supervised fashion. However, important objects are strongly related to the camera wearer's internal state such as his intentions and attention, and thus, only the person wearing the camera can provide the importance labels. Such a constraint makes the annotation process costly and limited in scalability. In this work, we show that we can detect important objects in first-person images without the supervision by the camera wearer or even third-person labelers. We formulate an important detection problem as an interplay between the 1) segmentation and 2) recognition agents. The segmentation agent first proposes a possible important object segmentation mask for each image, and then feeds it to the recognition agent, which learns to predict an important object mask using visual semantics and spatial features. We implement such an interplay between both agents via an alternating cross-pathway supervision scheme inside our proposed Visual-Spatial Network (VSN). Our VSN consists of spatial ('where') and visual ('what') pathways, one of which learns common visual semantics while the other focuses on the spatial location cues. Our unsupervised learning is accomplished via a cross-pathway supervision, where one pathway feeds its predictions to a segmentation agent, which proposes a candidate important object segmentation mask that is then used by the other pathway as a supervisory signal. We show our method's success on two different important object datasets, where our method achieves similar or better results as the supervised methods.

Original language	English (US)
Title of host publication	Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	1974-1982
Number of pages	9
ISBN (Electronic)	9781538610329
DOIs	https://doi.org/10.1109/ICCV.2017.216
State	Published - Dec 22 2017
Event	16th IEEE International Conference on Computer Vision, ICCV 2017 - Venice, Italy Duration: Oct 22 2017 → Oct 29 2017

Publication series

Name	Proceedings of the IEEE International Conference on Computer Vision
Volume	2017-October
ISSN (Print)	1550-5499

Other

Other	16th IEEE International Conference on Computer Vision, ICCV 2017
Country/Territory	Italy
City	Venice
Period	10/22/17 → 10/29/17

Bibliographical note

Publisher Copyright:
© 2017 IEEE.

Access

10.1109/ICCV.2017.216

OpenUrl availability

Full text

Cite this

Bertasius, G., Park, H. S., Yu, S. X., & Shi, J. (2017). Unsupervised Learning of Important Objects from First-Person Videos. In Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017 (pp. 1974-1982). Article 8237478 (Proceedings of the IEEE International Conference on Computer Vision; Vol. 2017-October). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCV.2017.216

Unsupervised Learning of Important Objects from First-Person Videos. / Bertasius, Gedas; Park, Hyun Soo; Yu, Stella X. et al.
Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017. Institute of Electrical and Electronics Engineers Inc., 2017. p. 1974-1982 8237478 (Proceedings of the IEEE International Conference on Computer Vision; Vol. 2017-October).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Bertasius, G, Park, HS, Yu, SX & Shi, J 2017, Unsupervised Learning of Important Objects from First-Person Videos. in Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017., 8237478, Proceedings of the IEEE International Conference on Computer Vision, vol. 2017-October, Institute of Electrical and Electronics Engineers Inc., pp. 1974-1982, 16th IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 10/22/17. https://doi.org/10.1109/ICCV.2017.216

@inproceedings{543d231b6ed442c9b0fa3a6fd10cd579,

title = "Unsupervised Learning of Important Objects from First-Person Videos",

abstract = "A first-person camera, placed at a person's head, captures, which objects are important to the camera wearer. Most prior methods for this task learn to detect such important objects from the manually labeled first-person data in a supervised fashion. However, important objects are strongly related to the camera wearer's internal state such as his intentions and attention, and thus, only the person wearing the camera can provide the importance labels. Such a constraint makes the annotation process costly and limited in scalability. In this work, we show that we can detect important objects in first-person images without the supervision by the camera wearer or even third-person labelers. We formulate an important detection problem as an interplay between the 1) segmentation and 2) recognition agents. The segmentation agent first proposes a possible important object segmentation mask for each image, and then feeds it to the recognition agent, which learns to predict an important object mask using visual semantics and spatial features. We implement such an interplay between both agents via an alternating cross-pathway supervision scheme inside our proposed Visual-Spatial Network (VSN). Our VSN consists of spatial ('where') and visual ('what') pathways, one of which learns common visual semantics while the other focuses on the spatial location cues. Our unsupervised learning is accomplished via a cross-pathway supervision, where one pathway feeds its predictions to a segmentation agent, which proposes a candidate important object segmentation mask that is then used by the other pathway as a supervisory signal. We show our method's success on two different important object datasets, where our method achieves similar or better results as the supervised methods.",

author = "Gedas Bertasius and Park, {Hyun Soo} and Yu, {Stella X.} and Jianbo Shi",

note = "Publisher Copyright: {\textcopyright} 2017 IEEE.; 16th IEEE International Conference on Computer Vision, ICCV 2017 ; Conference date: 22-10-2017 Through 29-10-2017",

year = "2017",

month = dec,

day = "22",

doi = "10.1109/ICCV.2017.216",

language = "English (US)",

series = "Proceedings of the IEEE International Conference on Computer Vision",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "1974--1982",

booktitle = "Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017",

}

TY - GEN

T1 - Unsupervised Learning of Important Objects from First-Person Videos

AU - Bertasius, Gedas

AU - Park, Hyun Soo

AU - Yu, Stella X.

AU - Shi, Jianbo

PY - 2017/12/22

Y1 - 2017/12/22

N2 - A first-person camera, placed at a person's head, captures, which objects are important to the camera wearer. Most prior methods for this task learn to detect such important objects from the manually labeled first-person data in a supervised fashion. However, important objects are strongly related to the camera wearer's internal state such as his intentions and attention, and thus, only the person wearing the camera can provide the importance labels. Such a constraint makes the annotation process costly and limited in scalability. In this work, we show that we can detect important objects in first-person images without the supervision by the camera wearer or even third-person labelers. We formulate an important detection problem as an interplay between the 1) segmentation and 2) recognition agents. The segmentation agent first proposes a possible important object segmentation mask for each image, and then feeds it to the recognition agent, which learns to predict an important object mask using visual semantics and spatial features. We implement such an interplay between both agents via an alternating cross-pathway supervision scheme inside our proposed Visual-Spatial Network (VSN). Our VSN consists of spatial ('where') and visual ('what') pathways, one of which learns common visual semantics while the other focuses on the spatial location cues. Our unsupervised learning is accomplished via a cross-pathway supervision, where one pathway feeds its predictions to a segmentation agent, which proposes a candidate important object segmentation mask that is then used by the other pathway as a supervisory signal. We show our method's success on two different important object datasets, where our method achieves similar or better results as the supervised methods.

AB - A first-person camera, placed at a person's head, captures, which objects are important to the camera wearer. Most prior methods for this task learn to detect such important objects from the manually labeled first-person data in a supervised fashion. However, important objects are strongly related to the camera wearer's internal state such as his intentions and attention, and thus, only the person wearing the camera can provide the importance labels. Such a constraint makes the annotation process costly and limited in scalability. In this work, we show that we can detect important objects in first-person images without the supervision by the camera wearer or even third-person labelers. We formulate an important detection problem as an interplay between the 1) segmentation and 2) recognition agents. The segmentation agent first proposes a possible important object segmentation mask for each image, and then feeds it to the recognition agent, which learns to predict an important object mask using visual semantics and spatial features. We implement such an interplay between both agents via an alternating cross-pathway supervision scheme inside our proposed Visual-Spatial Network (VSN). Our VSN consists of spatial ('where') and visual ('what') pathways, one of which learns common visual semantics while the other focuses on the spatial location cues. Our unsupervised learning is accomplished via a cross-pathway supervision, where one pathway feeds its predictions to a segmentation agent, which proposes a candidate important object segmentation mask that is then used by the other pathway as a supervisory signal. We show our method's success on two different important object datasets, where our method achieves similar or better results as the supervised methods.

UR - http://www.scopus.com/inward/record.url?scp=85041905004&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85041905004&partnerID=8YFLogxK

U2 - 10.1109/ICCV.2017.216

DO - 10.1109/ICCV.2017.216

M3 - Conference contribution

AN - SCOPUS:85041905004

T3 - Proceedings of the IEEE International Conference on Computer Vision

SP - 1974

EP - 1982

BT - Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 16th IEEE International Conference on Computer Vision, ICCV 2017

Y2 - 22 October 2017 through 29 October 2017

ER -

Unsupervised Learning of Important Objects from First-Person Videos

Abstract

Publication series

Other

Bibliographical note

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this