Kavli Affiliate: Yi Zhou
| First 5 Authors: Ruiqi Wu, Na Su, Chenran Zhang, Tengfei Ma, Tao Zhou
| Summary:
Vision-language pretraining (VLP) has been investigated to generalize across
diverse downstream tasks for fundus image analysis. Although recent methods
showcase promising achievements, they significantly rely on large-scale private
image-text data but pay less attention to the pretraining manner, which limits
their further advancements. In this work, we introduce MM-Retinal V2, a
high-quality image-text paired dataset comprising CFP, FFA, and OCT image
modalities. Then, we propose a novel fundus vision-language pretraining model,
namely KeepFIT V2, which is pretrained by integrating knowledge from the elite
data spark into categorical public datasets. Specifically, a preliminary
textual pretraining is adopted to equip the text encoder with primarily
ophthalmic textual knowledge. Moreover, a hybrid image-text knowledge injection
module is designed for knowledge transfer, which is essentially based on a
combination of global semantic concepts from contrastive learning and local
appearance details from generative learning. Extensive experiments across
zero-shot, few-shot, and linear probing settings highlight the generalization
and transferability of KeepFIT V2, delivering performance competitive to
state-of-the-art fundus VLP models trained on large-scale private image-text
datasets. Our dataset and model are publicly available via
https://github.com/lxirich/MM-Retinal.
| Search Query: ArXiv Query: search_query=au:”Yi Zhou”&id_list=&start=0&max_results=3