Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method

Kavli Affiliate: Jing Wang

| First 5 Authors: Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng

| Summary:

Multimodal Large Language Models (MLLMs) have shown significant progress in
offline video understanding. However, applying these models to real-world
scenarios, such as autonomous driving and human-computer interaction, presents
unique challenges due to the need for real-time processing of continuous online
video streams. To this end, this paper presents systematic efforts from three
perspectives: evaluation benchmark, model architecture, and training strategy.
First, we introduce OVBench, a comprehensive question-answering benchmark
specifically designed to evaluate models’ ability to perceive, memorize, and
reason within online video contexts. It features six core task types across
three temporal contexts-past, present, and future-forming 16 subtasks from
diverse datasets. Second, we propose a new Pyramid Memory Bank (PMB) that
effectively retains key spatiotemporal information in video streams. Third, we
proposed an offline-to-online learning paradigm, designing an interleaved
dialogue format for online video data and constructing an instruction-tuning
dataset tailored for online video training. This framework led to the
development of VideoChat-Online, a robust and efficient model for online video
understanding. Despite the lower computational cost and higher efficiency,
VideoChat-Online outperforms existing state-of-the-art offline and online
models across popular offline video benchmarks and OVBench, demonstrating the
effectiveness of our model architecture and training strategy.

| Search Query: ArXiv Query: search_query=au:”Jing Wang”&id_list=&start=0&max_results=3

Read More