Action recognition using global spatio-temporal features derived from sparse representations

Guruprasad Somasundaram; Anoop Cherian; Vassilios Morellas; Nikolaos Papanikolopoulos

doi:10.1016/j.cviu.2014.01.002

Action recognition using global spatio-temporal features derived from sparse representations

Guruprasad Somasundaram, Anoop Cherian, Vassilios Morellas, Nikolaos Papanikolopoulos

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

47 Scopus citations

Abstract

Recognizing actions is one of the important challenges in computer vision with respect to video data, with applications to surveillance, diagnostics of mental disorders, and video retrieval. Compared to other data modalities such as documents and images, processing video data demands orders of magnitude higher computational and storage resources. One way to alleviate this difficulty is to focus the computations to informative (salient) regions of the video. In this paper, we propose a novel global spatio-temporal self-similarity measure to score saliency using the ideas of dictionary learning and sparse coding. In contrast to existing methods that use local spatio-temporal feature detectors along with descriptors (such as HOG, HOG3D, and HOF), dictionary learning helps consider the saliency in a global setting (on the entire video) in a computationally efficient way. We consider only a small percentage of the most salient (least self-similar) regions found using our algorithm, over which spatio-temporal descriptors such as HOG and region covariance descriptors are computed. The ensemble of such block descriptors in a bag-of-features framework provides a holistic description of the motion sequence which can be used in a classification setting. Experiments on several benchmark datasets in video based action classification demonstrate that our approach performs competitively to the state of the art.

Original language	English (US)
Pages (from-to)	1-13
Number of pages	13
Journal	Computer Vision and Image Understanding
Volume	123
DOIs	https://doi.org/10.1016/j.cviu.2014.01.002
State	Published - Jun 2014

Bibliographical note

Funding Information:
This material is based upon work supported in part of the Minnesota Department of Transportation and by the National Science Foundation through grants #IIP-0443945, #CNS-0821474, #IIP-0934327, #CNS-1039741, #SMA-1028076, and #CNS-1338042 .

Keywords

Action classification
Activity recognition
Global spatio-temporal features

Access

10.1016/j.cviu.2014.01.002

OpenUrl availability

Full text

Cite this

@article{3d481a69728a4985925b6f315adf1907,

title = "Action recognition using global spatio-temporal features derived from sparse representations",

abstract = "Recognizing actions is one of the important challenges in computer vision with respect to video data, with applications to surveillance, diagnostics of mental disorders, and video retrieval. Compared to other data modalities such as documents and images, processing video data demands orders of magnitude higher computational and storage resources. One way to alleviate this difficulty is to focus the computations to informative (salient) regions of the video. In this paper, we propose a novel global spatio-temporal self-similarity measure to score saliency using the ideas of dictionary learning and sparse coding. In contrast to existing methods that use local spatio-temporal feature detectors along with descriptors (such as HOG, HOG3D, and HOF), dictionary learning helps consider the saliency in a global setting (on the entire video) in a computationally efficient way. We consider only a small percentage of the most salient (least self-similar) regions found using our algorithm, over which spatio-temporal descriptors such as HOG and region covariance descriptors are computed. The ensemble of such block descriptors in a bag-of-features framework provides a holistic description of the motion sequence which can be used in a classification setting. Experiments on several benchmark datasets in video based action classification demonstrate that our approach performs competitively to the state of the art.",

keywords = "Action classification, Activity recognition, Global spatio-temporal features",

author = "Guruprasad Somasundaram and Anoop Cherian and Vassilios Morellas and Nikolaos Papanikolopoulos",

note = "Funding Information: This material is based upon work supported in part of the Minnesota Department of Transportation and by the National Science Foundation through grants #IIP-0443945, #CNS-0821474, #IIP-0934327, #CNS-1039741, #SMA-1028076, and #CNS-1338042 . ",

year = "2014",

month = jun,

doi = "10.1016/j.cviu.2014.01.002",

language = "English (US)",

volume = "123",

pages = "1--13",

journal = "Computer Vision and Image Understanding",

issn = "1077-3142",

publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Action recognition using global spatio-temporal features derived from sparse representations

AU - Somasundaram, Guruprasad

AU - Cherian, Anoop

AU - Morellas, Vassilios

AU - Papanikolopoulos, Nikolaos

N1 - Funding Information: This material is based upon work supported in part of the Minnesota Department of Transportation and by the National Science Foundation through grants #IIP-0443945, #CNS-0821474, #IIP-0934327, #CNS-1039741, #SMA-1028076, and #CNS-1338042 .

PY - 2014/6

Y1 - 2014/6

N2 - Recognizing actions is one of the important challenges in computer vision with respect to video data, with applications to surveillance, diagnostics of mental disorders, and video retrieval. Compared to other data modalities such as documents and images, processing video data demands orders of magnitude higher computational and storage resources. One way to alleviate this difficulty is to focus the computations to informative (salient) regions of the video. In this paper, we propose a novel global spatio-temporal self-similarity measure to score saliency using the ideas of dictionary learning and sparse coding. In contrast to existing methods that use local spatio-temporal feature detectors along with descriptors (such as HOG, HOG3D, and HOF), dictionary learning helps consider the saliency in a global setting (on the entire video) in a computationally efficient way. We consider only a small percentage of the most salient (least self-similar) regions found using our algorithm, over which spatio-temporal descriptors such as HOG and region covariance descriptors are computed. The ensemble of such block descriptors in a bag-of-features framework provides a holistic description of the motion sequence which can be used in a classification setting. Experiments on several benchmark datasets in video based action classification demonstrate that our approach performs competitively to the state of the art.

AB - Recognizing actions is one of the important challenges in computer vision with respect to video data, with applications to surveillance, diagnostics of mental disorders, and video retrieval. Compared to other data modalities such as documents and images, processing video data demands orders of magnitude higher computational and storage resources. One way to alleviate this difficulty is to focus the computations to informative (salient) regions of the video. In this paper, we propose a novel global spatio-temporal self-similarity measure to score saliency using the ideas of dictionary learning and sparse coding. In contrast to existing methods that use local spatio-temporal feature detectors along with descriptors (such as HOG, HOG3D, and HOF), dictionary learning helps consider the saliency in a global setting (on the entire video) in a computationally efficient way. We consider only a small percentage of the most salient (least self-similar) regions found using our algorithm, over which spatio-temporal descriptors such as HOG and region covariance descriptors are computed. The ensemble of such block descriptors in a bag-of-features framework provides a holistic description of the motion sequence which can be used in a classification setting. Experiments on several benchmark datasets in video based action classification demonstrate that our approach performs competitively to the state of the art.

KW - Action classification

KW - Activity recognition

KW - Global spatio-temporal features

UR - http://www.scopus.com/inward/record.url?scp=84899639091&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84899639091&partnerID=8YFLogxK

U2 - 10.1016/j.cviu.2014.01.002

DO - 10.1016/j.cviu.2014.01.002

M3 - Article

AN - SCOPUS:84899639091

SN - 1077-3142

VL - 123

SP - 1

EP - 13

JO - Computer Vision and Image Understanding

JF - Computer Vision and Image Understanding

ER -

Action recognition using global spatio-temporal features derived from sparse representations

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this