Aligning Bag of Regions for Open-Vocabulary Object Detection

Size Wu, Wenwei Zhang, Sheng Jin, Wentao Liu, Chen Change Loy*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

94 Citations (Scopus)

Abstract

Pre-trained vision-language models (VLMs) learn to align vision and language representations on large-scale datasets, where each image-text pair usually contains a bag of semantic concepts. However, existing open-vocabulary object detectors only align region embeddings individually with the corresponding features extracted from the VLMs. Such a design leaves the compositional structure of semantic concepts in a scene under-exploited, although the structure may be implicitly learned by the VLMs. In this work, we propose to align the embedding of bag of regions beyond individual regions. The proposed method groups contextually interrelated regions as a bag. The embeddings of regions in a bag are treated as embeddings of words in a sentence, and they are sent to the text encoder of a VLM to obtain the bag-of-regions embedding, which is learned to be aligned to the corresponding features extracted by a frozen VLM. Applied to the commonly used Faster R-CNN, our approach surpasses the previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks, respectively. Code and models are available at https://github.com/wusize/ovdet.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
PublisherIEEE Computer Society
Pages15254-15264
Number of pages11
ISBN (Electronic)9798350301298
DOIs
Publication statusPublished - 2023
Externally publishedYes
Event2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 - Vancouver, Canada
Duration: Jun 18 2023Jun 22 2023

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2023-June
ISSN (Print)1063-6919

Conference

Conference2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
Country/TerritoryCanada
CityVancouver
Period6/18/236/22/23

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

ASJC Scopus Subject Areas

  • Software
  • Computer Vision and Pattern Recognition

Keywords

  • detection
  • Recognition: Categorization
  • retrieval

Cite this