Kavli Affiliate: Yi Zhou
| First 5 Authors: Zirui Li, Siwei Wu, Xingyu Wang, Yi Zhou, Yizhi Li
| Summary:
The rapid advancement of unsupervised representation learning and large-scale
pre-trained vision-language models has significantly improved cross-modal
retrieval tasks. However, existing multi-modal information retrieval (MMIR)
studies lack a comprehensive exploration of document-level retrieval and suffer
from the absence of cross-domain datasets at this granularity. To address this
limitation, we introduce DocMMIR, a novel multi-modal document retrieval
framework designed explicitly to unify diverse document formats and domains,
including Wikipedia articles, scientific papers (arXiv), and presentation
slides, within a comprehensive retrieval scenario. We construct a large-scale
cross-domain multimodal benchmark, comprising 450K samples, which
systematically integrates textual and visual information. Our comprehensive
experimental analysis reveals substantial limitations in current
state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our
tasks, with only CLIP demonstrating reasonable zero-shot performance.
Furthermore, we conduct a systematic investigation of training strategies,
including cross-modal fusion methods and loss functions, and develop a tailored
approach to train CLIP on our benchmark. This results in a +31% improvement in
MRR@10 compared to the zero-shot baseline. All our data and code are released
in https://github.com/J1mL1/DocMMIR.
| Search Query: ArXiv Query: search_query=au:”Yi Zhou”&id_list=&start=0&max_results=3