Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow

Kavli Affiliate: Wei Gao

| First 5 Authors: Ruyang Liu, Ruyang Liu, , ,

| Summary:

Long-form video understanding has always been a challenging problem due to
the significant redundancy in both temporal and spatial contents. This
challenge is further exacerbated by the limited context length of Multimodal
Large Language Models (MLLMs). To address this issue, many previous works have
attempted to extract key video information, where the "key" is typically
semantic-aware and heavily dependent on the CLIP model as prior. In this paper,
we propose Flow4Agent, a novel framework that pioneeringly incorporates motion
priors from optical flow to facilitate LLM-based long video understanding.
Flow4Agent mitigates the redundancy in long videos at both temporal and spatial
levels through two core modules: Temporal Granularity Optimization (TGO)
adaptively refines framelevel hierarchies, which first leverages coarse flow
priors to group similar visual contents and then applies semantic priors to
filter out highly irrelevant scene information. Motion Token Pruning (MTP)
further refines the intra-frame visual representations, pruning high-redundancy
video tokens using fine-grained optical flow information. Extensive experiments
demonstrate that our Flow4Agent outperforms existing methods across a wide
range of video MLLM benchmarks, especially for hour-level video understanding
tasks, achieving 64.7% on Video-MME, 71.4% on MLVU and 60.4% on LongVideoBench.

| Search Query: ArXiv Query: search_query=au:”Wei Gao”&id_list=&start=0&max_results=3