Contextual Object Detection with Multimodal Large Language Models

Yuhang Zang; Wei Li; Jun Han; Kaiyang Zhou; Chen Change Loy

doi:10.1007/s11263-024-02214-4

Contextual Object Detection with Multimodal Large Language Models

Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, Chen Change Loy^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

18 Citations (Scopus)

Abstract

Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection—understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Our ContextDET involves three key submodels: (i) a visual encoder for extracting visual representations, (ii) a pre-trained LLM for multimodal context decoding, and (iii) a visual decoder for predicting bounding boxes given contextual object words. The new generate-then-detect framework enables us to detect object words within human vocabulary. Extensive experiments show the advantages of ContextDET on our proposed CODE benchmark, open-vocabulary detection, and referring image segmentation.

Original language	English
Pages (from-to)	825-843
Number of pages	19
Journal	International Journal of Computer Vision
Volume	133
Issue number	2
DOIs	https://doi.org/10.1007/s11263-024-02214-4
Publication status	Published - Feb 2025
Externally published	Yes

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.

ASJC Scopus Subject Areas

Software
Computer Vision and Pattern Recognition
Artificial Intelligence

Access to Document

10.1007/s11263-024-02214-4

Cite this

@article{5b293083e03f4d54a963c9cba15fb20f,

title = "Contextual Object Detection with Multimodal Large Language Models",

abstract = "Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection—understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Our ContextDET involves three key submodels: (i) a visual encoder for extracting visual representations, (ii) a pre-trained LLM for multimodal context decoding, and (iii) a visual decoder for predicting bounding boxes given contextual object words. The new generate-then-detect framework enables us to detect object words within human vocabulary. Extensive experiments show the advantages of ContextDET on our proposed CODE benchmark, open-vocabulary detection, and referring image segmentation.",

author = "Yuhang Zang and Wei Li and Jun Han and Kaiyang Zhou and Loy, \{Chen Change\}",

note = "Publisher Copyright: {\textcopyright} The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.",

year = "2025",

month = feb,

doi = "10.1007/s11263-024-02214-4",

language = "English",

volume = "133",

pages = "825--843",

journal = "International Journal of Computer Vision",

issn = "0920-5691",

publisher = "Springer Netherlands",

number = "2",

}

TY - JOUR

T1 - Contextual Object Detection with Multimodal Large Language Models

AU - Zang, Yuhang

AU - Li, Wei

AU - Han, Jun

AU - Zhou, Kaiyang

AU - Loy, Chen Change

N1 - Publisher Copyright: © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.

PY - 2025/2

Y1 - 2025/2

N2 - Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection—understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Our ContextDET involves three key submodels: (i) a visual encoder for extracting visual representations, (ii) a pre-trained LLM for multimodal context decoding, and (iii) a visual decoder for predicting bounding boxes given contextual object words. The new generate-then-detect framework enables us to detect object words within human vocabulary. Extensive experiments show the advantages of ContextDET on our proposed CODE benchmark, open-vocabulary detection, and referring image segmentation.

AB - Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection—understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Our ContextDET involves three key submodels: (i) a visual encoder for extracting visual representations, (ii) a pre-trained LLM for multimodal context decoding, and (iii) a visual decoder for predicting bounding boxes given contextual object words. The new generate-then-detect framework enables us to detect object words within human vocabulary. Extensive experiments show the advantages of ContextDET on our proposed CODE benchmark, open-vocabulary detection, and referring image segmentation.

UR - http://www.scopus.com/inward/record.url?scp=85201827041&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85201827041&partnerID=8YFLogxK

U2 - 10.1007/s11263-024-02214-4

DO - 10.1007/s11263-024-02214-4

M3 - Article

AN - SCOPUS:85201827041

SN - 0920-5691

VL - 133

SP - 825

EP - 843

JO - International Journal of Computer Vision

JF - International Journal of Computer Vision

IS - 2

ER -

Contextual Object Detection with Multimodal Large Language Models

Abstract

Bibliographical note

ASJC Scopus Subject Areas

Access to Document

Other files and links

Cite this