Kavli Affiliate: Li Xin Li | First 5 Authors: Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang | Summary: Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. Besides, the lack […]
Continue.. VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM