Open-Vocabulary DETR with Conditional Matching

Yuhang Zang; Wei Li; Kaiyang Zhou; Chen Huang; Chen Change Loy

doi:10.1007/978-3-031-20077-9_7

Open-Vocabulary DETR with Conditional Matching

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, Chen Change Loy^*

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

119 Citations (Scopus)

Abstract

Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from the community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form of either natural language or exemplar image. This offers great flexibility and user experience for human-computer interaction. To this end, we propose a novel open-vocabulary detector based on DETR—hence the name OV-DETR—which, once trained, can detect any object given its class name or an exemplar image. The biggest challenge of turning DETR into an open-vocabulary detector is that it is impossible to calculate the classification cost matrix of novel classes without access to their labeled images. To overcome this challenge, we formulate the learning objective as a binary matching one between input queries (class name or exemplar image) and the corresponding objects, which learns useful correspondence to generalize to unseen queries during testing. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR—the first end-to-end Transformer-based open-vocabulary detector—achieves non-trivial improvements over current state of the arts. Code is available at https://github.com/yuhangzang/OV-DETR.

Original language	English
Title of host publication	Computer Vision – ECCV 2022 - 17th European Conference, Proceedings
Editors	Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	106-122
Number of pages	17
ISBN (Print)	9783031200762
DOIs	https://doi.org/10.1007/978-3-031-20077-9_7
Publication status	Published - 2022
Externally published	Yes
Event	17th European Conference on Computer Vision, ECCV 2022 - Tel Aviv, Israel Duration: Oct 23 2022 → Oct 27 2022

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	13669 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	17th European Conference on Computer Vision, ECCV 2022
Country/Territory	Israel
City	Tel Aviv
Period	10/23/22 → 10/27/22

Bibliographical note

Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.

ASJC Scopus Subject Areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-031-20077-9_7

Cite this

Zang, Y., Li, W., Zhou, K., Huang, C., & Loy, C. C. (2022). Open-Vocabulary DETR with Conditional Matching. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, & T. Hassner (Eds.), Computer Vision – ECCV 2022 - 17th European Conference, Proceedings (pp. 106-122). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13669 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-20077-9_7

Zang, Yuhang ; Li, Wei ; Zhou, Kaiyang et al. / Open-Vocabulary DETR with Conditional Matching. Computer Vision – ECCV 2022 - 17th European Conference, Proceedings. editor / Shai Avidan ; Gabriel Brostow ; Moustapha Cissé ; Giovanni Maria Farinella ; Tal Hassner. Springer Science and Business Media Deutschland GmbH, 2022. pp. 106-122 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{159908853ece4f4294209fd4ac615428,

title = "Open-Vocabulary DETR with Conditional Matching",

abstract = "Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from the community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form of either natural language or exemplar image. This offers great flexibility and user experience for human-computer interaction. To this end, we propose a novel open-vocabulary detector based on DETR—hence the name OV-DETR—which, once trained, can detect any object given its class name or an exemplar image. The biggest challenge of turning DETR into an open-vocabulary detector is that it is impossible to calculate the classification cost matrix of novel classes without access to their labeled images. To overcome this challenge, we formulate the learning objective as a binary matching one between input queries (class name or exemplar image) and the corresponding objects, which learns useful correspondence to generalize to unseen queries during testing. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR—the first end-to-end Transformer-based open-vocabulary detector—achieves non-trivial improvements over current state of the arts. Code is available at https://github.com/yuhangzang/OV-DETR.",

author = "Yuhang Zang and Wei Li and Kaiyang Zhou and Chen Huang and Loy, \{Chen Change\}",

note = "Publisher Copyright: {\textcopyright} 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.; 17th European Conference on Computer Vision, ECCV 2022 ; Conference date: 23-10-2022 Through 27-10-2022",

year = "2022",

doi = "10.1007/978-3-031-20077-9\_7",

language = "English",

isbn = "9783031200762",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "106--122",

editor = "Shai Avidan and Gabriel Brostow and Moustapha Ciss{\'e} and Farinella, \{Giovanni Maria\} and Tal Hassner",

booktitle = "Computer Vision – ECCV 2022 - 17th European Conference, Proceedings",

address = "Germany",

}

Zang, Y, Li, W, Zhou, K, Huang, C & Loy, CC 2022, Open-Vocabulary DETR with Conditional Matching. in S Avidan, G Brostow, M Cissé, GM Farinella & T Hassner (eds), Computer Vision – ECCV 2022 - 17th European Conference, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13669 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 106-122, 17th European Conference on Computer Vision, ECCV 2022, Tel Aviv, Israel, 10/23/22. https://doi.org/10.1007/978-3-031-20077-9_7

Open-Vocabulary DETR with Conditional Matching. / Zang, Yuhang; Li, Wei; Zhou, Kaiyang et al.
Computer Vision – ECCV 2022 - 17th European Conference, Proceedings. ed. / Shai Avidan; Gabriel Brostow; Moustapha Cissé; Giovanni Maria Farinella; Tal Hassner. Springer Science and Business Media Deutschland GmbH, 2022. p. 106-122 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13669 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Open-Vocabulary DETR with Conditional Matching

AU - Zang, Yuhang

AU - Li, Wei

AU - Zhou, Kaiyang

AU - Huang, Chen

AU - Loy, Chen Change

PY - 2022

Y1 - 2022

N2 - Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from the community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form of either natural language or exemplar image. This offers great flexibility and user experience for human-computer interaction. To this end, we propose a novel open-vocabulary detector based on DETR—hence the name OV-DETR—which, once trained, can detect any object given its class name or an exemplar image. The biggest challenge of turning DETR into an open-vocabulary detector is that it is impossible to calculate the classification cost matrix of novel classes without access to their labeled images. To overcome this challenge, we formulate the learning objective as a binary matching one between input queries (class name or exemplar image) and the corresponding objects, which learns useful correspondence to generalize to unseen queries during testing. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR—the first end-to-end Transformer-based open-vocabulary detector—achieves non-trivial improvements over current state of the arts. Code is available at https://github.com/yuhangzang/OV-DETR.

AB - Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from the community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form of either natural language or exemplar image. This offers great flexibility and user experience for human-computer interaction. To this end, we propose a novel open-vocabulary detector based on DETR—hence the name OV-DETR—which, once trained, can detect any object given its class name or an exemplar image. The biggest challenge of turning DETR into an open-vocabulary detector is that it is impossible to calculate the classification cost matrix of novel classes without access to their labeled images. To overcome this challenge, we formulate the learning objective as a binary matching one between input queries (class name or exemplar image) and the corresponding objects, which learns useful correspondence to generalize to unseen queries during testing. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR—the first end-to-end Transformer-based open-vocabulary detector—achieves non-trivial improvements over current state of the arts. Code is available at https://github.com/yuhangzang/OV-DETR.

UR - http://www.scopus.com/inward/record.url?scp=85142695185&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85142695185&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-20077-9_7

DO - 10.1007/978-3-031-20077-9_7

M3 - Conference contribution

AN - SCOPUS:85142695185

SN - 9783031200762

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 106

EP - 122

BT - Computer Vision – ECCV 2022 - 17th European Conference, Proceedings

A2 - Avidan, Shai

A2 - Brostow, Gabriel

A2 - Cissé, Moustapha

A2 - Farinella, Giovanni Maria

A2 - Hassner, Tal

PB - Springer Science and Business Media Deutschland GmbH

T2 - 17th European Conference on Computer Vision, ECCV 2022

Y2 - 23 October 2022 through 27 October 2022

ER -

Zang Y, Li W, Zhou K, Huang C, Loy CC. Open-Vocabulary DETR with Conditional Matching. In Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T, editors, Computer Vision – ECCV 2022 - 17th European Conference, Proceedings. Springer Science and Business Media Deutschland GmbH. 2022. p. 106-122. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-031-20077-9_7