Crowded Scene Understanding by Deeply Learned Volumetric Slices

Jing Shao*, Chen Change Loy, Kai Kang, Xiaogang Wang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

32 Citations (Scopus)

Abstract

Crowd video analysis is one of the hallmark tasks of crowded scene understanding. While we observe a tremendous progress in image-based tasks with the rise of convolutional neural networks (CNNs), performance on video analysis has not (yet) attained the same level of success. In this paper, we introduce intuitive but effective temporal-aware crowd motion channels by uniformly slicing the video volume from different dimensions. Multiple CNN structures with different data-fusion strategies and weight-sharing schemes are proposed to learn the connectivity both spatially and temporally from these motion channels. To well demonstrate our deep model, we construct a new large-scale Who do What at someWhere crowd data set with 10 000 videos from 8257 crowded scenes, and build an attribute set with 94 attributes. Extensive experiments on crowd video attribute prediction demonstrate the effectiveness of our novel method over the state-of-the-art.

Original languageEnglish
Article number7517290
Pages (from-to)613-623
Number of pages11
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume27
Issue number3
DOIs
Publication statusPublished - Mar 2017
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2016 IEEE.

ASJC Scopus Subject Areas

  • Media Technology
  • Electrical and Electronic Engineering

Keywords

  • Crowd database
  • crowded scene understanding
  • deep neural network
  • spatiotemporal features
  • video analysis

Cite this