Episodic Memory Representation for Long-form Video Understanding

Kavli Affiliate: Long Zhang

| First 5 Authors: Yun Wang, Yun Wang, , ,

| Summary:

Video Large Language Models (Video-LLMs) excel at general video understanding
but struggle with long-form videos due to context window limits. Consequently,
recent approaches focus on keyframe retrieval, condensing lengthy videos into a
small set of informative frames. Despite their practicality, these methods
simplify the problem to static text image matching, overlooking spatio temporal
relationships crucial for capturing scene transitions and contextual
continuity, and may yield redundant keyframes with limited information,
diluting salient cues essential for accurate video question answering. To
address these limitations, we introduce Video-EM, a training free framework
inspired by the principles of human episodic memory, designed to facilitate
robust and contextually grounded reasoning. Rather than treating keyframes as
isolated visual entities, Video-EM explicitly models them as temporally ordered
episodic events, capturing both spatial relationships and temporal dynamics
necessary for accurately reconstructing the underlying narrative. Furthermore,
the framework leverages chain of thought (CoT) thinking with LLMs to
iteratively identify a minimal yet highly informative subset of episodic
memories, enabling efficient and accurate question answering by Video-LLMs.
Extensive evaluations on the Video-MME, EgoSchema, HourVideo, and LVBench
benchmarks confirm the superiority of Video-EM, which achieves highly
competitive results with performance gains of 4-9 percent over respective
baselines while utilizing fewer frames.

| Search Query: ArXiv Query: search_query=au:”Long Zhang”&id_list=&start=0&max_results=3