Crowded Scene Understanding by Deeply Learned Volumetric Slices

Jing Shao; Chen Change Loy; Kai Kang; Xiaogang Wang

doi:10.1109/TCSVT.2016.2593647

Crowded Scene Understanding by Deeply Learned Volumetric Slices

Jing Shao^*, Chen Change Loy, Kai Kang, Xiaogang Wang

^*Corresponding author for this work

Chinese University of Hong Kong

Research output: Contribution to journal › Article › peer-review

32 Citations (Scopus)

Abstract

Crowd video analysis is one of the hallmark tasks of crowded scene understanding. While we observe a tremendous progress in image-based tasks with the rise of convolutional neural networks (CNNs), performance on video analysis has not (yet) attained the same level of success. In this paper, we introduce intuitive but effective temporal-aware crowd motion channels by uniformly slicing the video volume from different dimensions. Multiple CNN structures with different data-fusion strategies and weight-sharing schemes are proposed to learn the connectivity both spatially and temporally from these motion channels. To well demonstrate our deep model, we construct a new large-scale Who do What at someWhere crowd data set with 10 000 videos from 8257 crowded scenes, and build an attribute set with 94 attributes. Extensive experiments on crowd video attribute prediction demonstrate the effectiveness of our novel method over the state-of-the-art.

Original language	English
Article number	7517290
Pages (from-to)	613-623
Number of pages	11
Journal	IEEE Transactions on Circuits and Systems for Video Technology
Volume	27
Issue number	3
DOIs	https://doi.org/10.1109/TCSVT.2016.2593647
Publication status	Published - Mar 2017
Externally published	Yes

Bibliographical note

Publisher Copyright:
© 2016 IEEE.

ASJC Scopus Subject Areas

Media Technology
Electrical and Electronic Engineering

Keywords

Crowd database
crowded scene understanding
deep neural network
spatiotemporal features
video analysis

Access to Document

10.1109/TCSVT.2016.2593647

Cite this

@article{76a195e99b1745ee851f33c3c55f2778,

title = "Crowded Scene Understanding by Deeply Learned Volumetric Slices",

abstract = "Crowd video analysis is one of the hallmark tasks of crowded scene understanding. While we observe a tremendous progress in image-based tasks with the rise of convolutional neural networks (CNNs), performance on video analysis has not (yet) attained the same level of success. In this paper, we introduce intuitive but effective temporal-aware crowd motion channels by uniformly slicing the video volume from different dimensions. Multiple CNN structures with different data-fusion strategies and weight-sharing schemes are proposed to learn the connectivity both spatially and temporally from these motion channels. To well demonstrate our deep model, we construct a new large-scale Who do What at someWhere crowd data set with 10 000 videos from 8257 crowded scenes, and build an attribute set with 94 attributes. Extensive experiments on crowd video attribute prediction demonstrate the effectiveness of our novel method over the state-of-the-art.",

keywords = "Crowd database, crowded scene understanding, deep neural network, spatiotemporal features, video analysis",

author = "Jing Shao and Loy, \{Chen Change\} and Kai Kang and Xiaogang Wang",

note = "Publisher Copyright: {\textcopyright} 2016 IEEE.",

year = "2017",

month = mar,

doi = "10.1109/TCSVT.2016.2593647",

language = "English",

volume = "27",

pages = "613--623",

journal = "IEEE Transactions on Circuits and Systems for Video Technology",

issn = "1051-8215",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "3",

}

TY - JOUR

T1 - Crowded Scene Understanding by Deeply Learned Volumetric Slices

AU - Shao, Jing

AU - Loy, Chen Change

AU - Kang, Kai

AU - Wang, Xiaogang

PY - 2017/3

Y1 - 2017/3

N2 - Crowd video analysis is one of the hallmark tasks of crowded scene understanding. While we observe a tremendous progress in image-based tasks with the rise of convolutional neural networks (CNNs), performance on video analysis has not (yet) attained the same level of success. In this paper, we introduce intuitive but effective temporal-aware crowd motion channels by uniformly slicing the video volume from different dimensions. Multiple CNN structures with different data-fusion strategies and weight-sharing schemes are proposed to learn the connectivity both spatially and temporally from these motion channels. To well demonstrate our deep model, we construct a new large-scale Who do What at someWhere crowd data set with 10 000 videos from 8257 crowded scenes, and build an attribute set with 94 attributes. Extensive experiments on crowd video attribute prediction demonstrate the effectiveness of our novel method over the state-of-the-art.

AB - Crowd video analysis is one of the hallmark tasks of crowded scene understanding. While we observe a tremendous progress in image-based tasks with the rise of convolutional neural networks (CNNs), performance on video analysis has not (yet) attained the same level of success. In this paper, we introduce intuitive but effective temporal-aware crowd motion channels by uniformly slicing the video volume from different dimensions. Multiple CNN structures with different data-fusion strategies and weight-sharing schemes are proposed to learn the connectivity both spatially and temporally from these motion channels. To well demonstrate our deep model, we construct a new large-scale Who do What at someWhere crowd data set with 10 000 videos from 8257 crowded scenes, and build an attribute set with 94 attributes. Extensive experiments on crowd video attribute prediction demonstrate the effectiveness of our novel method over the state-of-the-art.

KW - Crowd database

KW - crowded scene understanding

KW - deep neural network

KW - spatiotemporal features

KW - video analysis

UR - http://www.scopus.com/inward/record.url?scp=85015183549&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85015183549&partnerID=8YFLogxK

U2 - 10.1109/TCSVT.2016.2593647

DO - 10.1109/TCSVT.2016.2593647

M3 - Article

AN - SCOPUS:85015183549

SN - 1051-8215

VL - 27

SP - 613

EP - 623

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

IS - 3

M1 - 7517290

ER -

Crowded Scene Understanding by Deeply Learned Volumetric Slices

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Cite this