FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo Embeddings

Kavli Affiliate: Jing Wang

| First 5 Authors: Yulin Su, Min Yang, Minghui Qiu, Jing Wang, Tao Wang

| Summary:

Logo embedding plays a crucial role in various e-commerce applications by
facilitating image retrieval or recognition, such as intellectual property
protection and product search. However, current methods treat logo embedding as
a purely visual problem, which may limit their performance in real-world
scenarios. A notable issue is that the textual knowledge embedded in logo
images has not been adequately explored. Therefore, we propose a novel approach
that leverages textual knowledge as an auxiliary to improve the robustness of
logo embedding. The emerging Multimodal Large Language Models (MLLMs) have
demonstrated remarkable capabilities in both visual and textual understanding
and could become valuable visual assistants in understanding logo images.
Inspired by this observation, our proposed method, FashionLOGO, aims to utilize
MLLMs to enhance fashion logo embedding. We explore how MLLMs can improve logo
embedding by prompting them to generate explicit textual knowledge through
three types of prompts, including image OCR, brief captions, and detailed
descriptions prompts, in a zero-shot setting. We adopt a cross-attention
transformer to enable image embedding queries to learn supplementary knowledge
from textual embeddings automatically. To reduce computational costs, we only
use the image embedding model in the inference stage, similar to traditional
inference pipelines. Our extensive experiments on three real-world datasets
demonstrate that FashionLOGO learns generalized and robust logo embeddings,
achieving state-of-the-art performance in all benchmark datasets. Furthermore,
we conduct comprehensive ablation studies to demonstrate the performance
improvements resulting from the introduction of MLLMs.

| Search Query: ArXiv Query: search_query=au:”Jing Wang”&id_list=&start=0&max_results=3