Explore In-Context Segmentation via Latent Diffusion Models

Chaoyang Wang; Xiangtai Li; Henghui Ding; Lu Qi; Jiangning Zhang; Yunhai Tong; Chen Change Loy; Shuicheng Yan

doi:10.1609/aaai.v39i7.32812

Explore In-Context Segmentation via Latent Diffusion Models

Chaoyang Wang, Xiangtai Li, Henghui Ding, Lu Qi, Jiangning Zhang, Yunhai Tong, Chen Change Loy, Shuicheng Yan

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

In-context segmentation has drawn increasing attention with the advent of vision foundation models. Its goal is to segment objects using given reference images. Most existing approaches adopt metric learning or masked image modeling to build the correlation between visual prompts and input image queries. This work approaches the problem from a fresh perspective – unlocking the capability of the latent diffusion model (LDM) for in-context segmentation and investigating different design choices. Specifically, we examine the problem from three angles: instruction extraction, output alignment, and meta-architectures. We design a two-stage masking strategy to prevent interfering information from leaking into the instructions. In addition, we propose an augmented pseudo-masking target to ensure the model predicts without forgetting the original images. Moreover, we build a new and fair in-context segmentation benchmark that covers both image and video datasets. Experiments validate the effectiveness of our approach, demonstrating comparable or even stronger results than previous specialist or visual foundation models. We hope our work inspires others to rethink the unification of segmentation and generation.

Original language	English
Title of host publication	Special Track on AI Alignment
Editors	Toby Walsh, Julie Shah, Zico Kolter
Publisher	Association for the Advancement of Artificial Intelligence
Pages	7545-7553
Number of pages	9
Edition	7
ISBN (Electronic)	157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 157735897X, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978, 9781577358978
DOIs	https://doi.org/10.1609/aaai.v39i7.32812
Publication status	Published - Apr 11 2025
Externally published	Yes
Event	39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, United States Duration: Feb 25 2025 → Mar 4 2025

Publication series

Name	Proceedings of the AAAI Conference on Artificial Intelligence
Number	7
Volume	39
ISSN (Print)	2159-5399
ISSN (Electronic)	2374-3468

Conference

Conference	39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Country/Territory	United States
City	Philadelphia
Period	2/25/25 → 3/4/25

Bibliographical note

Publisher Copyright:
Copyright © 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

ASJC Scopus Subject Areas

Artificial Intelligence

Access to Document

10.1609/aaai.v39i7.32812

Cite this

Wang, C., Li, X., Ding, H., Qi, L., Zhang, J., Tong, Y., Loy, C. C., & Yan, S. (2025). Explore In-Context Segmentation via Latent Diffusion Models. In T. Walsh, J. Shah, & Z. Kolter (Eds.), Special Track on AI Alignment (7 ed., pp. 7545-7553). (Proceedings of the AAAI Conference on Artificial Intelligence; Vol. 39, No. 7). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v39i7.32812

@inproceedings{3007293b0c2e470d914b149c9fdd5de4,

title = "Explore In-Context Segmentation via Latent Diffusion Models",

abstract = "In-context segmentation has drawn increasing attention with the advent of vision foundation models. Its goal is to segment objects using given reference images. Most existing approaches adopt metric learning or masked image modeling to build the correlation between visual prompts and input image queries. This work approaches the problem from a fresh perspective – unlocking the capability of the latent diffusion model (LDM) for in-context segmentation and investigating different design choices. Specifically, we examine the problem from three angles: instruction extraction, output alignment, and meta-architectures. We design a two-stage masking strategy to prevent interfering information from leaking into the instructions. In addition, we propose an augmented pseudo-masking target to ensure the model predicts without forgetting the original images. Moreover, we build a new and fair in-context segmentation benchmark that covers both image and video datasets. Experiments validate the effectiveness of our approach, demonstrating comparable or even stronger results than previous specialist or visual foundation models. We hope our work inspires others to rethink the unification of segmentation and generation.",

author = "Chaoyang Wang and Xiangtai Li and Henghui Ding and Lu Qi and Jiangning Zhang and Yunhai Tong and Loy, \{Chen Change\} and Shuicheng Yan",

note = "Publisher Copyright: Copyright {\textcopyright} 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.; 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 ; Conference date: 25-02-2025 Through 04-03-2025",

year = "2025",

month = apr,

day = "11",

doi = "10.1609/aaai.v39i7.32812",

language = "English",

series = "Proceedings of the AAAI Conference on Artificial Intelligence",

publisher = "Association for the Advancement of Artificial Intelligence",

number = "7",

pages = "7545--7553",

editor = "Toby Walsh and Julie Shah and Zico Kolter",

booktitle = "Special Track on AI Alignment",

edition = "7",

}

Wang, C, Li, X, Ding, H, Qi, L, Zhang, J, Tong, Y, Loy, CC & Yan, S 2025, Explore In-Context Segmentation via Latent Diffusion Models. in T Walsh, J Shah & Z Kolter (eds), Special Track on AI Alignment. 7 edn, Proceedings of the AAAI Conference on Artificial Intelligence, no. 7, vol. 39, Association for the Advancement of Artificial Intelligence, pp. 7545-7553, 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025, Philadelphia, United States, 2/25/25. https://doi.org/10.1609/aaai.v39i7.32812

Explore In-Context Segmentation via Latent Diffusion Models. / Wang, Chaoyang; Li, Xiangtai; Ding, Henghui et al.
Special Track on AI Alignment. ed. / Toby Walsh; Julie Shah; Zico Kolter. 7. ed. Association for the Advancement of Artificial Intelligence, 2025. p. 7545-7553 (Proceedings of the AAAI Conference on Artificial Intelligence; Vol. 39, No. 7).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Explore In-Context Segmentation via Latent Diffusion Models

AU - Wang, Chaoyang

AU - Li, Xiangtai

AU - Ding, Henghui

AU - Qi, Lu

AU - Zhang, Jiangning

AU - Tong, Yunhai

AU - Loy, Chen Change

AU - Yan, Shuicheng

PY - 2025/4/11

Y1 - 2025/4/11

N2 - In-context segmentation has drawn increasing attention with the advent of vision foundation models. Its goal is to segment objects using given reference images. Most existing approaches adopt metric learning or masked image modeling to build the correlation between visual prompts and input image queries. This work approaches the problem from a fresh perspective – unlocking the capability of the latent diffusion model (LDM) for in-context segmentation and investigating different design choices. Specifically, we examine the problem from three angles: instruction extraction, output alignment, and meta-architectures. We design a two-stage masking strategy to prevent interfering information from leaking into the instructions. In addition, we propose an augmented pseudo-masking target to ensure the model predicts without forgetting the original images. Moreover, we build a new and fair in-context segmentation benchmark that covers both image and video datasets. Experiments validate the effectiveness of our approach, demonstrating comparable or even stronger results than previous specialist or visual foundation models. We hope our work inspires others to rethink the unification of segmentation and generation.

AB - In-context segmentation has drawn increasing attention with the advent of vision foundation models. Its goal is to segment objects using given reference images. Most existing approaches adopt metric learning or masked image modeling to build the correlation between visual prompts and input image queries. This work approaches the problem from a fresh perspective – unlocking the capability of the latent diffusion model (LDM) for in-context segmentation and investigating different design choices. Specifically, we examine the problem from three angles: instruction extraction, output alignment, and meta-architectures. We design a two-stage masking strategy to prevent interfering information from leaking into the instructions. In addition, we propose an augmented pseudo-masking target to ensure the model predicts without forgetting the original images. Moreover, we build a new and fair in-context segmentation benchmark that covers both image and video datasets. Experiments validate the effectiveness of our approach, demonstrating comparable or even stronger results than previous specialist or visual foundation models. We hope our work inspires others to rethink the unification of segmentation and generation.

UR - http://www.scopus.com/inward/record.url?scp=105004002663&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=105004002663&partnerID=8YFLogxK

U2 - 10.1609/aaai.v39i7.32812

DO - 10.1609/aaai.v39i7.32812

M3 - Conference contribution

AN - SCOPUS:105004002663

T3 - Proceedings of the AAAI Conference on Artificial Intelligence

SP - 7545

EP - 7553

BT - Special Track on AI Alignment

A2 - Walsh, Toby

A2 - Shah, Julie

A2 - Kolter, Zico

PB - Association for the Advancement of Artificial Intelligence

T2 - 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025

Y2 - 25 February 2025 through 4 March 2025

ER -

Explore In-Context Segmentation via Latent Diffusion Models

Abstract

Publication series

Conference

Bibliographical note

ASJC Scopus Subject Areas

Access to Document

Other files and links

Cite this