Slicing convolutional neural network for crowd video understanding

Jing Shao; Chen Change Loy; Kai Kang; Xiaogang Wang

doi:10.1109/CVPR.2016.606

Slicing convolutional neural network for crowd video understanding

Jing Shao, Chen Change Loy, Kai Kang, Xiaogang Wang

Chinese University of Hong Kong

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

80 Citations (Scopus)

Abstract

Learning and capturing both appearance and dynamic representations are pivotal for crowd video understanding. Convolutional Neural Networks (CNNs) have shown its remarkable potential in learning appearance representations from images. However, the learning of dynamic representation, and how it can be effectively combined with appearance features for video analysis, remains an open problem. In this study, we propose a novel spatio-temporal CNN, named Slicing CNN (S-CNN), based on the decomposition of 3D feature maps into 2D spatio-and 2D temporal-slices representations. The decomposition brings unique advantages: (1) the model is capable of capturing dynamics of different semantic units such as groups and objects, (2) it learns separated appearance and dynamic representations while keeping proper interactions between them, and (3) it exploits the selectiveness of spatial filters to discard irrelevant background clutter for crowd understanding. We demonstrate the effectiveness of the proposed S-CNN model on the WWW crowd video dataset for attribute recognition and observe significant performance improvements to the state-of-the-art methods (62.55% from 51.84% [21]).

Original language	English
Title of host publication	Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016
Publisher	IEEE Computer Society
Pages	5620-5628
Number of pages	9
ISBN (Electronic)	9781467388504
DOIs	https://doi.org/10.1109/CVPR.2016.606
Publication status	Published - Dec 9 2016
Externally published	Yes
Event	29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 - Las Vegas, United States Duration: Jun 26 2016 → Jul 1 2016

Publication series

Name	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume	2016-December
ISSN (Print)	1063-6919

Conference

Conference	29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016
Country/Territory	United States
City	Las Vegas
Period	6/26/16 → 7/1/16

Bibliographical note

Publisher Copyright:
© 2016 IEEE.

ASJC Scopus Subject Areas

Software
Computer Vision and Pattern Recognition

Access to Document

10.1109/CVPR.2016.606

Cite this

Shao, J., Loy, C. C., Kang, K., & Wang, X. (2016). Slicing convolutional neural network for crowd video understanding. In Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 (pp. 5620-5628). Article 7780975 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2016-December). IEEE Computer Society. https://doi.org/10.1109/CVPR.2016.606

@inproceedings{fc2363f068644c469486f757151836cc,

title = "Slicing convolutional neural network for crowd video understanding",

abstract = "Learning and capturing both appearance and dynamic representations are pivotal for crowd video understanding. Convolutional Neural Networks (CNNs) have shown its remarkable potential in learning appearance representations from images. However, the learning of dynamic representation, and how it can be effectively combined with appearance features for video analysis, remains an open problem. In this study, we propose a novel spatio-temporal CNN, named Slicing CNN (S-CNN), based on the decomposition of 3D feature maps into 2D spatio-and 2D temporal-slices representations. The decomposition brings unique advantages: (1) the model is capable of capturing dynamics of different semantic units such as groups and objects, (2) it learns separated appearance and dynamic representations while keeping proper interactions between them, and (3) it exploits the selectiveness of spatial filters to discard irrelevant background clutter for crowd understanding. We demonstrate the effectiveness of the proposed S-CNN model on the WWW crowd video dataset for attribute recognition and observe significant performance improvements to the state-of-the-art methods (62.55\% from 51.84\% [21]).",

author = "Jing Shao and Loy, \{Chen Change\} and Kai Kang and Xiaogang Wang",

note = "Publisher Copyright: {\textcopyright} 2016 IEEE.; 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 ; Conference date: 26-06-2016 Through 01-07-2016",

year = "2016",

month = dec,

day = "9",

doi = "10.1109/CVPR.2016.606",

language = "English",

series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

publisher = "IEEE Computer Society",

pages = "5620--5628",

booktitle = "Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016",

address = "United States",

}

Shao, J, Loy, CC, Kang, K & Wang, X 2016, Slicing convolutional neural network for crowd video understanding. in Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016., 7780975, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-December, IEEE Computer Society, pp. 5620-5628, 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, United States, 6/26/16. https://doi.org/10.1109/CVPR.2016.606

Slicing convolutional neural network for crowd video understanding. / Shao, Jing; Loy, Chen Change; Kang, Kai et al.
Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. IEEE Computer Society, 2016. p. 5620-5628 7780975 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2016-December).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Slicing convolutional neural network for crowd video understanding

AU - Shao, Jing

AU - Loy, Chen Change

AU - Kang, Kai

AU - Wang, Xiaogang

PY - 2016/12/9

Y1 - 2016/12/9

N2 - Learning and capturing both appearance and dynamic representations are pivotal for crowd video understanding. Convolutional Neural Networks (CNNs) have shown its remarkable potential in learning appearance representations from images. However, the learning of dynamic representation, and how it can be effectively combined with appearance features for video analysis, remains an open problem. In this study, we propose a novel spatio-temporal CNN, named Slicing CNN (S-CNN), based on the decomposition of 3D feature maps into 2D spatio-and 2D temporal-slices representations. The decomposition brings unique advantages: (1) the model is capable of capturing dynamics of different semantic units such as groups and objects, (2) it learns separated appearance and dynamic representations while keeping proper interactions between them, and (3) it exploits the selectiveness of spatial filters to discard irrelevant background clutter for crowd understanding. We demonstrate the effectiveness of the proposed S-CNN model on the WWW crowd video dataset for attribute recognition and observe significant performance improvements to the state-of-the-art methods (62.55% from 51.84% [21]).

AB - Learning and capturing both appearance and dynamic representations are pivotal for crowd video understanding. Convolutional Neural Networks (CNNs) have shown its remarkable potential in learning appearance representations from images. However, the learning of dynamic representation, and how it can be effectively combined with appearance features for video analysis, remains an open problem. In this study, we propose a novel spatio-temporal CNN, named Slicing CNN (S-CNN), based on the decomposition of 3D feature maps into 2D spatio-and 2D temporal-slices representations. The decomposition brings unique advantages: (1) the model is capable of capturing dynamics of different semantic units such as groups and objects, (2) it learns separated appearance and dynamic representations while keeping proper interactions between them, and (3) it exploits the selectiveness of spatial filters to discard irrelevant background clutter for crowd understanding. We demonstrate the effectiveness of the proposed S-CNN model on the WWW crowd video dataset for attribute recognition and observe significant performance improvements to the state-of-the-art methods (62.55% from 51.84% [21]).

UR - http://www.scopus.com/inward/record.url?scp=84986254030&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84986254030&partnerID=8YFLogxK

U2 - 10.1109/CVPR.2016.606

DO - 10.1109/CVPR.2016.606

M3 - Conference contribution

AN - SCOPUS:84986254030

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 5620

EP - 5628

BT - Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016

PB - IEEE Computer Society

T2 - 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016

Y2 - 26 June 2016 through 1 July 2016

ER -

Slicing convolutional neural network for crowd video understanding

Abstract

Publication series

Conference

Bibliographical note

ASJC Scopus Subject Areas

Access to Document

Other files and links

Cite this