Kavli Affiliate: Jing Wang | First 5 Authors: Shihao Li, Shihao Li, , , | Summary: Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. […]
Continue.. IF-VidCap: Can Video Caption Models Follow Instructions?