Mix-and-match tuning for self-supervised semantic segmentation

Xiaohang Zhan; Ziwei Liu; Ping Luo; Xiaoou Tang; Chen Change Loy

Mix-and-match tuning for self-supervised semantic segmentation

Xiaohang Zhan, Ziwei Liu, Ping Luo, Xiaoou Tang, Chen Change Loy

Chinese University of Hong Kong

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

27 Citations (Scopus)

Abstract

Deep convolutional networks for semantic image segmentation typically require large-scale labeled data, e.g., ImageNet and MS COCO, for network pre-training. To reduce annotation efforts, self-supervised semantic segmentation is recently proposed to pre-train a network without any human-provided labels. The key of this new form of learning is to design a proxy task (e.g., image colorization), from which a discriminative loss can be formulated on unlabeled data. Many proxy tasks, however, lack the critical supervision signals that could induce discriminative representation for the target image segmentation task. Thus self-supervision's performance is still far from that of supervised pre-training. In this study, we overcome this limitation by incorporating a 'mix-and-match' (M&M) tuning stage in the self-supervision pipeline. The proposed approach is readily pluggable to many self-supervision methods and does not use more annotated samples than the original process. Yet, it is capable of boosting the performance of target image segmentation task to surpass fully-supervised pre-trained counterpart. The improvement is made possible by better harnessing the limited pixel-wise annotations in the target dataset. Specifically, we first introduce the 'mix' stage, which sparsely samples and mixes patches from the target set to reflect rich and diverse local patch statistics of target images. A 'match' stage then forms a class-wise connected graph, which can be used to derive a strong triplet-based discriminative loss for fine-tuning the network. Our paradigm follows the standard practice in existing self-supervised studies and no extra data or label is required. With the proposed M&M approach, for the first time, a self-supervision method can achieve comparable or even better performance compared to its ImageNet pre-trained counterpart on both PASCAL VOC2012 dataset and CityScapes dataset.

Original language	English
Title of host publication	32nd AAAI Conference on Artificial Intelligence, AAAI 2018
Publisher	AAAI press
Pages	7534-7541
Number of pages	8
ISBN (Electronic)	9781577358008
Publication status	Published - 2018
Externally published	Yes
Event	32nd AAAI Conference on Artificial Intelligence, AAAI 2018 - New Orleans, United States Duration: Feb 2 2018 → Feb 7 2018

Publication series

Name	32nd AAAI Conference on Artificial Intelligence, AAAI 2018

Conference

Conference	32nd AAAI Conference on Artificial Intelligence, AAAI 2018
Country/Territory	United States
City	New Orleans
Period	2/2/18 → 2/7/18

Bibliographical note

Publisher Copyright:
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

ASJC Scopus Subject Areas

Artificial Intelligence

Cite this

@inproceedings{bd6b82b58fba49f19cc2da3f3062b25d,

title = "Mix-and-match tuning for self-supervised semantic segmentation",

abstract = "Deep convolutional networks for semantic image segmentation typically require large-scale labeled data, e.g., ImageNet and MS COCO, for network pre-training. To reduce annotation efforts, self-supervised semantic segmentation is recently proposed to pre-train a network without any human-provided labels. The key of this new form of learning is to design a proxy task (e.g., image colorization), from which a discriminative loss can be formulated on unlabeled data. Many proxy tasks, however, lack the critical supervision signals that could induce discriminative representation for the target image segmentation task. Thus self-supervision's performance is still far from that of supervised pre-training. In this study, we overcome this limitation by incorporating a 'mix-and-match' (M\&M) tuning stage in the self-supervision pipeline. The proposed approach is readily pluggable to many self-supervision methods and does not use more annotated samples than the original process. Yet, it is capable of boosting the performance of target image segmentation task to surpass fully-supervised pre-trained counterpart. The improvement is made possible by better harnessing the limited pixel-wise annotations in the target dataset. Specifically, we first introduce the 'mix' stage, which sparsely samples and mixes patches from the target set to reflect rich and diverse local patch statistics of target images. A 'match' stage then forms a class-wise connected graph, which can be used to derive a strong triplet-based discriminative loss for fine-tuning the network. Our paradigm follows the standard practice in existing self-supervised studies and no extra data or label is required. With the proposed M\&M approach, for the first time, a self-supervision method can achieve comparable or even better performance compared to its ImageNet pre-trained counterpart on both PASCAL VOC2012 dataset and CityScapes dataset.",

author = "Xiaohang Zhan and Ziwei Liu and Ping Luo and Xiaoou Tang and Loy, \{Chen Change\}",

note = "Publisher Copyright: Copyright {\textcopyright} 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 32nd AAAI Conference on Artificial Intelligence, AAAI 2018 ; Conference date: 02-02-2018 Through 07-02-2018",

year = "2018",

language = "English",

series = "32nd AAAI Conference on Artificial Intelligence, AAAI 2018",

publisher = "AAAI press",

pages = "7534--7541",

booktitle = "32nd AAAI Conference on Artificial Intelligence, AAAI 2018",

}

TY - GEN

T1 - Mix-and-match tuning for self-supervised semantic segmentation

AU - Zhan, Xiaohang

AU - Liu, Ziwei

AU - Luo, Ping

AU - Tang, Xiaoou

AU - Loy, Chen Change

PY - 2018

Y1 - 2018

N2 - Deep convolutional networks for semantic image segmentation typically require large-scale labeled data, e.g., ImageNet and MS COCO, for network pre-training. To reduce annotation efforts, self-supervised semantic segmentation is recently proposed to pre-train a network without any human-provided labels. The key of this new form of learning is to design a proxy task (e.g., image colorization), from which a discriminative loss can be formulated on unlabeled data. Many proxy tasks, however, lack the critical supervision signals that could induce discriminative representation for the target image segmentation task. Thus self-supervision's performance is still far from that of supervised pre-training. In this study, we overcome this limitation by incorporating a 'mix-and-match' (M&M) tuning stage in the self-supervision pipeline. The proposed approach is readily pluggable to many self-supervision methods and does not use more annotated samples than the original process. Yet, it is capable of boosting the performance of target image segmentation task to surpass fully-supervised pre-trained counterpart. The improvement is made possible by better harnessing the limited pixel-wise annotations in the target dataset. Specifically, we first introduce the 'mix' stage, which sparsely samples and mixes patches from the target set to reflect rich and diverse local patch statistics of target images. A 'match' stage then forms a class-wise connected graph, which can be used to derive a strong triplet-based discriminative loss for fine-tuning the network. Our paradigm follows the standard practice in existing self-supervised studies and no extra data or label is required. With the proposed M&M approach, for the first time, a self-supervision method can achieve comparable or even better performance compared to its ImageNet pre-trained counterpart on both PASCAL VOC2012 dataset and CityScapes dataset.

AB - Deep convolutional networks for semantic image segmentation typically require large-scale labeled data, e.g., ImageNet and MS COCO, for network pre-training. To reduce annotation efforts, self-supervised semantic segmentation is recently proposed to pre-train a network without any human-provided labels. The key of this new form of learning is to design a proxy task (e.g., image colorization), from which a discriminative loss can be formulated on unlabeled data. Many proxy tasks, however, lack the critical supervision signals that could induce discriminative representation for the target image segmentation task. Thus self-supervision's performance is still far from that of supervised pre-training. In this study, we overcome this limitation by incorporating a 'mix-and-match' (M&M) tuning stage in the self-supervision pipeline. The proposed approach is readily pluggable to many self-supervision methods and does not use more annotated samples than the original process. Yet, it is capable of boosting the performance of target image segmentation task to surpass fully-supervised pre-trained counterpart. The improvement is made possible by better harnessing the limited pixel-wise annotations in the target dataset. Specifically, we first introduce the 'mix' stage, which sparsely samples and mixes patches from the target set to reflect rich and diverse local patch statistics of target images. A 'match' stage then forms a class-wise connected graph, which can be used to derive a strong triplet-based discriminative loss for fine-tuning the network. Our paradigm follows the standard practice in existing self-supervised studies and no extra data or label is required. With the proposed M&M approach, for the first time, a self-supervision method can achieve comparable or even better performance compared to its ImageNet pre-trained counterpart on both PASCAL VOC2012 dataset and CityScapes dataset.

UR - http://www.scopus.com/inward/record.url?scp=85060496730&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85060496730&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85060496730

T3 - 32nd AAAI Conference on Artificial Intelligence, AAAI 2018

SP - 7534

EP - 7541

BT - 32nd AAAI Conference on Artificial Intelligence, AAAI 2018

PB - AAAI press

T2 - 32nd AAAI Conference on Artificial Intelligence, AAAI 2018

Y2 - 2 February 2018 through 7 February 2018

ER -

Mix-and-match tuning for self-supervised semantic segmentation

Abstract

Publication series

Conference

Bibliographical note

ASJC Scopus Subject Areas

Other files and links

Cite this