Video Storytelling: Textual Summaries for Events

Junnan Li; Yongkang Wong; Qi Zhao; Mohan S. Kankanhalli

doi:10.1109/TMM.2019.2930041

Video Storytelling: Textual Summaries for Events

Junnan Li, Yongkang Wong, Qi Zhao, Mohan S. Kankanhalli

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

28 Scopus citations

Abstract

Bridging vision and natural language is a longstanding goal in computer vision and multimedia research. While earlier works focus on generating a single-sentence description for visual content, recent works have studied paragraph generation. In this paper, we introduce the problem of video storytelling, which aims at generating coherent and succinct stories for long videos. Video storytelling introduces new challenges, mainly due to the diversity of the story and the length and complexity of the video. We propose novel methods to address the challenges. First, we propose a context-aware framework for multimodal embedding learning, where we design a residual bidirectional recurrent neural network to leverage contextual information from past and future. The multimodal embedding is then used to retrieve sentences for video clips. Second, we propose a Narrator model to select clips that are representative of the underlying storyline. The Narrator is formulated as a reinforcement learning agent, which is trained by directly optimizing the textual metric of the generated story. We evaluate our method on the video story dataset, a new dataset that we have collected to enable the study. We compare our method with multiple state-of-the-art baselines and show that our method achieves better performance, in terms of quantitative measures and user study.

Original language	English (US)
Article number	8768045
Pages (from-to)	554-565
Number of pages	12
Journal	IEEE Transactions on Multimedia
Volume	22
Issue number	2
DOIs	https://doi.org/10.1109/TMM.2019.2930041
State	Published - Feb 2020

Bibliographical note

Publisher Copyright:
© 1999-2012 IEEE.

Keywords

Video storytelling
multimodal embedding learning
sentence retrieval
video captioning

Access

10.1109/TMM.2019.2930041

OpenUrl availability

Full text

Cite this

@article{56527b908624421f8c61ac5efa17f9fd,

title = "Video Storytelling: Textual Summaries for Events",

abstract = "Bridging vision and natural language is a longstanding goal in computer vision and multimedia research. While earlier works focus on generating a single-sentence description for visual content, recent works have studied paragraph generation. In this paper, we introduce the problem of video storytelling, which aims at generating coherent and succinct stories for long videos. Video storytelling introduces new challenges, mainly due to the diversity of the story and the length and complexity of the video. We propose novel methods to address the challenges. First, we propose a context-aware framework for multimodal embedding learning, where we design a residual bidirectional recurrent neural network to leverage contextual information from past and future. The multimodal embedding is then used to retrieve sentences for video clips. Second, we propose a Narrator model to select clips that are representative of the underlying storyline. The Narrator is formulated as a reinforcement learning agent, which is trained by directly optimizing the textual metric of the generated story. We evaluate our method on the video story dataset, a new dataset that we have collected to enable the study. We compare our method with multiple state-of-the-art baselines and show that our method achieves better performance, in terms of quantitative measures and user study.",

keywords = "Video storytelling, multimodal embedding learning, sentence retrieval, video captioning",

author = "Junnan Li and Yongkang Wong and Qi Zhao and Kankanhalli, {Mohan S.}",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2020",

month = feb,

doi = "10.1109/TMM.2019.2930041",

language = "English (US)",

volume = "22",

pages = "554--565",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "2",

}

TY - JOUR

T1 - Video Storytelling

T2 - Textual Summaries for Events

AU - Li, Junnan

AU - Wong, Yongkang

AU - Zhao, Qi

AU - Kankanhalli, Mohan S.

PY - 2020/2

Y1 - 2020/2

N2 - Bridging vision and natural language is a longstanding goal in computer vision and multimedia research. While earlier works focus on generating a single-sentence description for visual content, recent works have studied paragraph generation. In this paper, we introduce the problem of video storytelling, which aims at generating coherent and succinct stories for long videos. Video storytelling introduces new challenges, mainly due to the diversity of the story and the length and complexity of the video. We propose novel methods to address the challenges. First, we propose a context-aware framework for multimodal embedding learning, where we design a residual bidirectional recurrent neural network to leverage contextual information from past and future. The multimodal embedding is then used to retrieve sentences for video clips. Second, we propose a Narrator model to select clips that are representative of the underlying storyline. The Narrator is formulated as a reinforcement learning agent, which is trained by directly optimizing the textual metric of the generated story. We evaluate our method on the video story dataset, a new dataset that we have collected to enable the study. We compare our method with multiple state-of-the-art baselines and show that our method achieves better performance, in terms of quantitative measures and user study.

AB - Bridging vision and natural language is a longstanding goal in computer vision and multimedia research. While earlier works focus on generating a single-sentence description for visual content, recent works have studied paragraph generation. In this paper, we introduce the problem of video storytelling, which aims at generating coherent and succinct stories for long videos. Video storytelling introduces new challenges, mainly due to the diversity of the story and the length and complexity of the video. We propose novel methods to address the challenges. First, we propose a context-aware framework for multimodal embedding learning, where we design a residual bidirectional recurrent neural network to leverage contextual information from past and future. The multimodal embedding is then used to retrieve sentences for video clips. Second, we propose a Narrator model to select clips that are representative of the underlying storyline. The Narrator is formulated as a reinforcement learning agent, which is trained by directly optimizing the textual metric of the generated story. We evaluate our method on the video story dataset, a new dataset that we have collected to enable the study. We compare our method with multiple state-of-the-art baselines and show that our method achieves better performance, in terms of quantitative measures and user study.

KW - Video storytelling

KW - multimodal embedding learning

KW - sentence retrieval

KW - video captioning

UR - http://www.scopus.com/inward/record.url?scp=85079588054&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85079588054&partnerID=8YFLogxK

U2 - 10.1109/TMM.2019.2930041

DO - 10.1109/TMM.2019.2930041

M3 - Article

AN - SCOPUS:85079588054

SN - 1520-9210

VL - 22

SP - 554

EP - 565

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

IS - 2

M1 - 8768045

ER -

Video Storytelling: Textual Summaries for Events

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this