Empowering Segmentation Ability to Multi-modal Large Language Models

Kavli Affiliate: Jing Wang

| First 5 Authors: Yuqi Yang, Peng-Tao Jiang, Jing Wang, Hao Zhang, Kai Zhao

| Summary:

Multi-modal large language models (MLLMs) can understand image-language
prompts and demonstrate impressive reasoning ability. In this paper, we extend
MLLMs’ output by empowering MLLMs with the segmentation ability. The extended
MLLMs can both output language responses to the image-language prompts and
segment the regions that the complex question or query in the language prompts
focuses on. To this end, the existing work, LISA, enlarges the original word
embeddings with an additional segment token and fine-tunes dialogue generation
and query-focused segmentation together, where the feature of the segment token
is used to prompt the segment-anything model. Although they achieve superior
segmentation performance, we observe that the dialogue ability decreases by a
large margin compared to the original MLLMs. To maintain the original MLLMs’
dialogue ability, we propose a novel MLLMs framework, coined as LLaVASeg, which
leverages a chain-of-thought prompting strategy to instruct the MLLMs to
segment the target region queried by the user. The MLLMs are first prompted to
reason about the simple description of the target region from the complicated
user query, then extract the visual attributes of the target region according
to the understanding of MLLMs to the image. These visual attributes, such as
color and relative locations, are utilized to prompt the downstream
segmentation model. Experiments show that the proposed method keeps the
original dialogue ability and equips the MLLMs’ model with strong reasoning
segmentation ability. The code is available at
https://github.com/YuqiYang213/LLaVASeg.

| Search Query: ArXiv Query: search_query=au:”Jing Wang”&id_list=&start=0&max_results=3