Aligning Bag of Regions for Open-Vocabulary Object Detection

Size Wu; Wenwei Zhang; Sheng Jin; Wentao Liu; Chen Change Loy

doi:10.1109/CVPR52729.2023.01464

Aligning Bag of Regions for Open-Vocabulary Object Detection

Size Wu, Wenwei Zhang, Sheng Jin, Wentao Liu, Chen Change Loy^*

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

94 Citations (Scopus)

Abstract

Pre-trained vision-language models (VLMs) learn to align vision and language representations on large-scale datasets, where each image-text pair usually contains a bag of semantic concepts. However, existing open-vocabulary object detectors only align region embeddings individually with the corresponding features extracted from the VLMs. Such a design leaves the compositional structure of semantic concepts in a scene under-exploited, although the structure may be implicitly learned by the VLMs. In this work, we propose to align the embedding of bag of regions beyond individual regions. The proposed method groups contextually interrelated regions as a bag. The embeddings of regions in a bag are treated as embeddings of words in a sentence, and they are sent to the text encoder of a VLM to obtain the bag-of-regions embedding, which is learned to be aligned to the corresponding features extracted by a frozen VLM. Applied to the commonly used Faster R-CNN, our approach surpasses the previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks, respectively. Code and models are available at https://github.com/wusize/ovdet.

Original language	English
Title of host publication	Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
Publisher	IEEE Computer Society
Pages	15254-15264
Number of pages	11
ISBN (Electronic)	9798350301298
DOIs	https://doi.org/10.1109/CVPR52729.2023.01464
Publication status	Published - 2023
Externally published	Yes
Event	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 - Vancouver, Canada Duration: Jun 18 2023 → Jun 22 2023

Publication series

Name	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume	2023-June
ISSN (Print)	1063-6919

Conference

Conference	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
Country/Territory	Canada
City	Vancouver
Period	6/18/23 → 6/22/23

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

ASJC Scopus Subject Areas

Software
Computer Vision and Pattern Recognition

Keywords

detection
Recognition: Categorization
retrieval

Access to Document

10.1109/CVPR52729.2023.01464

Cite this

Wu, S., Zhang, W., Jin, S., Liu, W., & Loy, C. C. (2023). Aligning Bag of Regions for Open-Vocabulary Object Detection. In Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 (pp. 15254-15264). (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2023-June). IEEE Computer Society. https://doi.org/10.1109/CVPR52729.2023.01464

@inproceedings{b1dea4411d20414388b0f7ac88f25871,

title = "Aligning Bag of Regions for Open-Vocabulary Object Detection",

abstract = "Pre-trained vision-language models (VLMs) learn to align vision and language representations on large-scale datasets, where each image-text pair usually contains a bag of semantic concepts. However, existing open-vocabulary object detectors only align region embeddings individually with the corresponding features extracted from the VLMs. Such a design leaves the compositional structure of semantic concepts in a scene under-exploited, although the structure may be implicitly learned by the VLMs. In this work, we propose to align the embedding of bag of regions beyond individual regions. The proposed method groups contextually interrelated regions as a bag. The embeddings of regions in a bag are treated as embeddings of words in a sentence, and they are sent to the text encoder of a VLM to obtain the bag-of-regions embedding, which is learned to be aligned to the corresponding features extracted by a frozen VLM. Applied to the commonly used Faster R-CNN, our approach surpasses the previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks, respectively. Code and models are available at https://github.com/wusize/ovdet.",

keywords = "detection, Recognition: Categorization, retrieval",

author = "Size Wu and Wenwei Zhang and Sheng Jin and Wentao Liu and Loy, \{Chen Change\}",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 ; Conference date: 18-06-2023 Through 22-06-2023",

year = "2023",

doi = "10.1109/CVPR52729.2023.01464",

language = "English",

series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

publisher = "IEEE Computer Society",

pages = "15254--15264",

booktitle = "Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023",

address = "United States",

}

Wu, S, Zhang, W, Jin, S, Liu, W & Loy, CC 2023, Aligning Bag of Regions for Open-Vocabulary Object Detection. in Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2023-June, IEEE Computer Society, pp. 15254-15264, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, Canada, 6/18/23. https://doi.org/10.1109/CVPR52729.2023.01464

Aligning Bag of Regions for Open-Vocabulary Object Detection. / Wu, Size; Zhang, Wenwei; Jin, Sheng et al.
Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. IEEE Computer Society, 2023. p. 15254-15264 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2023-June).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Aligning Bag of Regions for Open-Vocabulary Object Detection

AU - Wu, Size

AU - Zhang, Wenwei

AU - Jin, Sheng

AU - Liu, Wentao

AU - Loy, Chen Change

PY - 2023

Y1 - 2023

N2 - Pre-trained vision-language models (VLMs) learn to align vision and language representations on large-scale datasets, where each image-text pair usually contains a bag of semantic concepts. However, existing open-vocabulary object detectors only align region embeddings individually with the corresponding features extracted from the VLMs. Such a design leaves the compositional structure of semantic concepts in a scene under-exploited, although the structure may be implicitly learned by the VLMs. In this work, we propose to align the embedding of bag of regions beyond individual regions. The proposed method groups contextually interrelated regions as a bag. The embeddings of regions in a bag are treated as embeddings of words in a sentence, and they are sent to the text encoder of a VLM to obtain the bag-of-regions embedding, which is learned to be aligned to the corresponding features extracted by a frozen VLM. Applied to the commonly used Faster R-CNN, our approach surpasses the previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks, respectively. Code and models are available at https://github.com/wusize/ovdet.

AB - Pre-trained vision-language models (VLMs) learn to align vision and language representations on large-scale datasets, where each image-text pair usually contains a bag of semantic concepts. However, existing open-vocabulary object detectors only align region embeddings individually with the corresponding features extracted from the VLMs. Such a design leaves the compositional structure of semantic concepts in a scene under-exploited, although the structure may be implicitly learned by the VLMs. In this work, we propose to align the embedding of bag of regions beyond individual regions. The proposed method groups contextually interrelated regions as a bag. The embeddings of regions in a bag are treated as embeddings of words in a sentence, and they are sent to the text encoder of a VLM to obtain the bag-of-regions embedding, which is learned to be aligned to the corresponding features extracted by a frozen VLM. Applied to the commonly used Faster R-CNN, our approach surpasses the previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks, respectively. Code and models are available at https://github.com/wusize/ovdet.

KW - detection

KW - Recognition: Categorization

KW - retrieval

UR - http://www.scopus.com/inward/record.url?scp=85172436537&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85172436537&partnerID=8YFLogxK

U2 - 10.1109/CVPR52729.2023.01464

DO - 10.1109/CVPR52729.2023.01464

M3 - Conference contribution

AN - SCOPUS:85172436537

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 15254

EP - 15264

BT - Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023

PB - IEEE Computer Society

T2 - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023

Y2 - 18 June 2023 through 22 June 2023

ER -

Wu S, Zhang W, Jin S, Liu W, Loy CC. Aligning Bag of Regions for Open-Vocabulary Object Detection. In Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. IEEE Computer Society. 2023. p. 15254-15264. (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). doi: 10.1109/CVPR52729.2023.01464

Aligning Bag of Regions for Open-Vocabulary Object Detection

Abstract

Publication series

Conference

Bibliographical note

ASJC Scopus Subject Areas

Keywords

Access to Document

Other files and links

Cite this