Extract Free Dense Labels from CLIP

Chong Zhou; Chen Change Loy; Bo Dai

doi:10.1007/978-3-031-19815-1_40

Extract Free Dense Labels from CLIP

Chong Zhou, Chen Change Loy^*, Bo Dai

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

257 Citations (Scopus)

Abstract

Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. Many recent studies leverage the pre-trained CLIP models for image-level classification and manipulation. In this paper, we wish examine the intrinsic potential of CLIP for pixel-level dense prediction, specifically in semantic segmentation. To this end, with minimal modification, we show that MaskCLIP yields compelling segmentation results on open concepts across various datasets in the absence of annotations and fine-tuning. By adding pseudo labeling and self-training, MaskCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins, e.g., mIoUs of unseen classes on PASCAL VOC/PASCAL Context/COCO Stuff are improved from 35.6/20.7/30.3 to 86.1/66.7/54.7. We also test the robustness of MaskCLIP under input corruption and evaluate its capability in discriminating fine-grained objects and novel concepts. Our finding suggests that MaskCLIP can serve as a new reliable source of supervision for dense prediction tasks to achieve annotation-free segmentation. Source code is available here.

Original language	English
Title of host publication	Computer Vision – ECCV 2022 - 17th European Conference, Proceedings
Editors	Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	696-712
Number of pages	17
ISBN (Print)	9783031198144
DOIs	https://doi.org/10.1007/978-3-031-19815-1_40
Publication status	Published - 2022
Externally published	Yes
Event	17th European Conference on Computer Vision, ECCV 2022 - Tel Aviv, Israel Duration: Oct 23 2022 → Oct 27 2022

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	13688 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	17th European Conference on Computer Vision, ECCV 2022
Country/Territory	Israel
City	Tel Aviv
Period	10/23/22 → 10/27/22

Bibliographical note

Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.

ASJC Scopus Subject Areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-031-19815-1_40

Cite this

Zhou, C., Loy, C. C., & Dai, B. (2022). Extract Free Dense Labels from CLIP. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, & T. Hassner (Eds.), Computer Vision – ECCV 2022 - 17th European Conference, Proceedings (pp. 696-712). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13688 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-19815-1_40

Zhou, Chong ; Loy, Chen Change ; Dai, Bo. / Extract Free Dense Labels from CLIP. Computer Vision – ECCV 2022 - 17th European Conference, Proceedings. editor / Shai Avidan ; Gabriel Brostow ; Moustapha Cissé ; Giovanni Maria Farinella ; Tal Hassner. Springer Science and Business Media Deutschland GmbH, 2022. pp. 696-712 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{7eef3f05cf554a6da296d46ed6d26aef,

title = "Extract Free Dense Labels from CLIP",

abstract = "Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. Many recent studies leverage the pre-trained CLIP models for image-level classification and manipulation. In this paper, we wish examine the intrinsic potential of CLIP for pixel-level dense prediction, specifically in semantic segmentation. To this end, with minimal modification, we show that MaskCLIP yields compelling segmentation results on open concepts across various datasets in the absence of annotations and fine-tuning. By adding pseudo labeling and self-training, MaskCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins, e.g., mIoUs of unseen classes on PASCAL VOC/PASCAL Context/COCO Stuff are improved from 35.6/20.7/30.3 to 86.1/66.7/54.7. We also test the robustness of MaskCLIP under input corruption and evaluate its capability in discriminating fine-grained objects and novel concepts. Our finding suggests that MaskCLIP can serve as a new reliable source of supervision for dense prediction tasks to achieve annotation-free segmentation. Source code is available here.",

author = "Chong Zhou and Loy, \{Chen Change\} and Bo Dai",

note = "Publisher Copyright: {\textcopyright} 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.; 17th European Conference on Computer Vision, ECCV 2022 ; Conference date: 23-10-2022 Through 27-10-2022",

year = "2022",

doi = "10.1007/978-3-031-19815-1\_40",

language = "English",

isbn = "9783031198144",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "696--712",

editor = "Shai Avidan and Gabriel Brostow and Moustapha Ciss{\'e} and Farinella, \{Giovanni Maria\} and Tal Hassner",

booktitle = "Computer Vision – ECCV 2022 - 17th European Conference, Proceedings",

address = "Germany",

}

Zhou, C, Loy, CC & Dai, B 2022, Extract Free Dense Labels from CLIP. in S Avidan, G Brostow, M Cissé, GM Farinella & T Hassner (eds), Computer Vision – ECCV 2022 - 17th European Conference, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13688 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 696-712, 17th European Conference on Computer Vision, ECCV 2022, Tel Aviv, Israel, 10/23/22. https://doi.org/10.1007/978-3-031-19815-1_40

Extract Free Dense Labels from CLIP. / Zhou, Chong; Loy, Chen Change; Dai, Bo.
Computer Vision – ECCV 2022 - 17th European Conference, Proceedings. ed. / Shai Avidan; Gabriel Brostow; Moustapha Cissé; Giovanni Maria Farinella; Tal Hassner. Springer Science and Business Media Deutschland GmbH, 2022. p. 696-712 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13688 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Extract Free Dense Labels from CLIP

AU - Zhou, Chong

AU - Loy, Chen Change

AU - Dai, Bo

PY - 2022

Y1 - 2022

N2 - Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. Many recent studies leverage the pre-trained CLIP models for image-level classification and manipulation. In this paper, we wish examine the intrinsic potential of CLIP for pixel-level dense prediction, specifically in semantic segmentation. To this end, with minimal modification, we show that MaskCLIP yields compelling segmentation results on open concepts across various datasets in the absence of annotations and fine-tuning. By adding pseudo labeling and self-training, MaskCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins, e.g., mIoUs of unseen classes on PASCAL VOC/PASCAL Context/COCO Stuff are improved from 35.6/20.7/30.3 to 86.1/66.7/54.7. We also test the robustness of MaskCLIP under input corruption and evaluate its capability in discriminating fine-grained objects and novel concepts. Our finding suggests that MaskCLIP can serve as a new reliable source of supervision for dense prediction tasks to achieve annotation-free segmentation. Source code is available here.

AB - Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. Many recent studies leverage the pre-trained CLIP models for image-level classification and manipulation. In this paper, we wish examine the intrinsic potential of CLIP for pixel-level dense prediction, specifically in semantic segmentation. To this end, with minimal modification, we show that MaskCLIP yields compelling segmentation results on open concepts across various datasets in the absence of annotations and fine-tuning. By adding pseudo labeling and self-training, MaskCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins, e.g., mIoUs of unseen classes on PASCAL VOC/PASCAL Context/COCO Stuff are improved from 35.6/20.7/30.3 to 86.1/66.7/54.7. We also test the robustness of MaskCLIP under input corruption and evaluate its capability in discriminating fine-grained objects and novel concepts. Our finding suggests that MaskCLIP can serve as a new reliable source of supervision for dense prediction tasks to achieve annotation-free segmentation. Source code is available here.

UR - http://www.scopus.com/inward/record.url?scp=85142706450&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85142706450&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-19815-1_40

DO - 10.1007/978-3-031-19815-1_40

M3 - Conference contribution

AN - SCOPUS:85142706450

SN - 9783031198144

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 696

EP - 712

BT - Computer Vision – ECCV 2022 - 17th European Conference, Proceedings

A2 - Avidan, Shai

A2 - Brostow, Gabriel

A2 - Cissé, Moustapha

A2 - Farinella, Giovanni Maria

A2 - Hassner, Tal

PB - Springer Science and Business Media Deutschland GmbH

T2 - 17th European Conference on Computer Vision, ECCV 2022

Y2 - 23 October 2022 through 27 October 2022

ER -

Zhou C, Loy CC, Dai B. Extract Free Dense Labels from CLIP. In Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T, editors, Computer Vision – ECCV 2022 - 17th European Conference, Proceedings. Springer Science and Business Media Deutschland GmbH. 2022. p. 696-712. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-031-19815-1_40