Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation

Xiangtai Li; Wenwei Zhang; Jiangmiao Pang; Kai Chen; Guangliang Cheng; Yunhai Tong; Chen Change Loy

doi:10.1109/CVPR52688.2022.01828

Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation

Xiangtai Li, Wenwei Zhang, Jiangmiao Pang, Kai Chen, Guangliang Cheng^*, Yunhai Tong^*, Chen Change Loy

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

74 Citations (Scopus)

Abstract

This paper presents Video K-Net, a simple, strong, and unified framework for fully end-to-end video panoptic seg-mentation. The method is built upon K-Net, a method that unifies image segmentation via a group of learnable ker-nels. We observe that these learnable kernels from K-Net, which encode object appearances and contexts, can naturally associate identical instances across video frames. Motivated by this observation, Video K-Net learns to simultaneously segment and track 'things' and 'stuff' in a video with simple kernel-based appearance modeling and cross-temporal kernel interaction. Despite the simplicity, it achieves state-of-the-art video panoptic segmentation results on Citscapes-VPS and KITTI-STEP without bells and whistles. In particular on KITTI-STEP, the simple method can boost almost 12% relative improvements over previous methods. We also validate its generalization on video semantic segmentation, where we boost various baselines by 2% on the VSPW dataset. Moreover, we extend K-Net into clip-level video framework for video instance segmentation where we obtain 40.5% for ResNet50 backbone and 51.5% mAP for Swin-base on YouTube-2019 validation set. We hope this simple yet effective method can serve as a new flexible baseline in video segmentation.11Both code and models are released at here.

Original language	English
Title of host publication	Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Publisher	IEEE Computer Society
Pages	18825-18835
Number of pages	11
ISBN (Electronic)	9781665469463
DOIs	https://doi.org/10.1109/CVPR52688.2022.01828
Publication status	Published - 2022
Externally published	Yes
Event	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 - New Orleans, United States Duration: Jun 19 2022 → Jun 24 2022

Publication series

Name	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume	2022-June
ISSN (Print)	1063-6919

Conference

Conference	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Country/Territory	United States
City	New Orleans
Period	6/19/22 → 6/24/22

Bibliographical note

Publisher Copyright:
© 2022 IEEE.

ASJC Scopus Subject Areas

Software
Computer Vision and Pattern Recognition

Keywords

grouping and shape analysis
Scene analysis and understanding
Segmentation
Video analysis and understanding
Vision applications and systems

Access to Document

10.1109/CVPR52688.2022.01828

Cite this

Li, X., Zhang, W., Pang, J., Chen, K., Cheng, G., Tong, Y., & Loy, C. C. (2022). Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation. In Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 (pp. 18825-18835). (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2022-June). IEEE Computer Society. https://doi.org/10.1109/CVPR52688.2022.01828

@inproceedings{0f982bfd00fc47dc94f6f81857548b59,

title = "Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation",

abstract = "This paper presents Video K-Net, a simple, strong, and unified framework for fully end-to-end video panoptic seg-mentation. The method is built upon K-Net, a method that unifies image segmentation via a group of learnable ker-nels. We observe that these learnable kernels from K-Net, which encode object appearances and contexts, can naturally associate identical instances across video frames. Motivated by this observation, Video K-Net learns to simultaneously segment and track 'things' and 'stuff' in a video with simple kernel-based appearance modeling and cross-temporal kernel interaction. Despite the simplicity, it achieves state-of-the-art video panoptic segmentation results on Citscapes-VPS and KITTI-STEP without bells and whistles. In particular on KITTI-STEP, the simple method can boost almost 12\% relative improvements over previous methods. We also validate its generalization on video semantic segmentation, where we boost various baselines by 2\% on the VSPW dataset. Moreover, we extend K-Net into clip-level video framework for video instance segmentation where we obtain 40.5\% for ResNet50 backbone and 51.5\% mAP for Swin-base on YouTube-2019 validation set. We hope this simple yet effective method can serve as a new flexible baseline in video segmentation.11Both code and models are released at here.",

keywords = "grouping and shape analysis, Scene analysis and understanding, Segmentation, Video analysis and understanding, Vision applications and systems",

author = "Xiangtai Li and Wenwei Zhang and Jiangmiao Pang and Kai Chen and Guangliang Cheng and Yunhai Tong and Loy, \{Chen Change\}",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.; 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 ; Conference date: 19-06-2022 Through 24-06-2022",

year = "2022",

doi = "10.1109/CVPR52688.2022.01828",

language = "English",

series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

publisher = "IEEE Computer Society",

pages = "18825--18835",

booktitle = "Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022",

address = "United States",

}

Li, X, Zhang, W, Pang, J, Chen, K, Cheng, G, Tong, Y & Loy, CC 2022, Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation. in Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2022-June, IEEE Computer Society, pp. 18825-18835, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, United States, 6/19/22. https://doi.org/10.1109/CVPR52688.2022.01828

Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation. / Li, Xiangtai; Zhang, Wenwei; Pang, Jiangmiao et al.
Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. IEEE Computer Society, 2022. p. 18825-18835 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2022-June).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Video K-Net

T2 - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022

AU - Li, Xiangtai

AU - Zhang, Wenwei

AU - Pang, Jiangmiao

AU - Chen, Kai

AU - Cheng, Guangliang

AU - Tong, Yunhai

AU - Loy, Chen Change

PY - 2022

Y1 - 2022

N2 - This paper presents Video K-Net, a simple, strong, and unified framework for fully end-to-end video panoptic seg-mentation. The method is built upon K-Net, a method that unifies image segmentation via a group of learnable ker-nels. We observe that these learnable kernels from K-Net, which encode object appearances and contexts, can naturally associate identical instances across video frames. Motivated by this observation, Video K-Net learns to simultaneously segment and track 'things' and 'stuff' in a video with simple kernel-based appearance modeling and cross-temporal kernel interaction. Despite the simplicity, it achieves state-of-the-art video panoptic segmentation results on Citscapes-VPS and KITTI-STEP without bells and whistles. In particular on KITTI-STEP, the simple method can boost almost 12% relative improvements over previous methods. We also validate its generalization on video semantic segmentation, where we boost various baselines by 2% on the VSPW dataset. Moreover, we extend K-Net into clip-level video framework for video instance segmentation where we obtain 40.5% for ResNet50 backbone and 51.5% mAP for Swin-base on YouTube-2019 validation set. We hope this simple yet effective method can serve as a new flexible baseline in video segmentation.11Both code and models are released at here.

AB - This paper presents Video K-Net, a simple, strong, and unified framework for fully end-to-end video panoptic seg-mentation. The method is built upon K-Net, a method that unifies image segmentation via a group of learnable ker-nels. We observe that these learnable kernels from K-Net, which encode object appearances and contexts, can naturally associate identical instances across video frames. Motivated by this observation, Video K-Net learns to simultaneously segment and track 'things' and 'stuff' in a video with simple kernel-based appearance modeling and cross-temporal kernel interaction. Despite the simplicity, it achieves state-of-the-art video panoptic segmentation results on Citscapes-VPS and KITTI-STEP without bells and whistles. In particular on KITTI-STEP, the simple method can boost almost 12% relative improvements over previous methods. We also validate its generalization on video semantic segmentation, where we boost various baselines by 2% on the VSPW dataset. Moreover, we extend K-Net into clip-level video framework for video instance segmentation where we obtain 40.5% for ResNet50 backbone and 51.5% mAP for Swin-base on YouTube-2019 validation set. We hope this simple yet effective method can serve as a new flexible baseline in video segmentation.11Both code and models are released at here.

KW - grouping and shape analysis

KW - Scene analysis and understanding

KW - Segmentation

KW - Video analysis and understanding

KW - Vision applications and systems

UR - http://www.scopus.com/inward/record.url?scp=85138705881&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85138705881&partnerID=8YFLogxK

U2 - 10.1109/CVPR52688.2022.01828

DO - 10.1109/CVPR52688.2022.01828

M3 - Conference contribution

AN - SCOPUS:85138705881

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 18825

EP - 18835

BT - Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022

PB - IEEE Computer Society

Y2 - 19 June 2022 through 24 June 2022

ER -

Li X, Zhang W, Pang J, Chen K, Cheng G, Tong Y et al. Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation. In Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. IEEE Computer Society. 2022. p. 18825-18835. (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). doi: 10.1109/CVPR52688.2022.01828

Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation

Abstract

Publication series

Conference

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Cite this