Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization

Kavli Affiliate: Long Zhang

| First 5 Authors: Yuanli Wu, Yuanli Wu, , ,

| Summary:

We propose a rubric-guided, pseudo-labeled, and prompt-driven zero-shot video
summarization framework that bridges large language models with structured
semantic reasoning. A small subset of human annotations is converted into
high-confidence pseudo labels and organized into dataset-adaptive rubrics
defining clear evaluation dimensions such as thematic relevance, action detail,
and narrative progression. During inference, boundary scenes, including the
opening and closing segments, are scored independently based on their own
descriptions, while intermediate scenes incorporate concise summaries of
adjacent segments to assess narrative continuity and redundancy. This design
enables the language model to balance local salience with global coherence
without any parameter tuning. Across three benchmarks, the proposed method
achieves stable and competitive results, with F1 scores of 57.58 on SumMe,
63.05 on TVSum, and 53.79 on QFVS, surpassing zero-shot baselines by +0.85,
+0.84, and +0.37, respectively. These outcomes demonstrate that rubric-guided
pseudo labeling combined with contextual prompting effectively stabilizes
LLM-based scoring and establishes a general, interpretable, and training-free
paradigm for both generic and query-focused video summarization.

| Search Query: ArXiv Query: search_query=au:”Long Zhang”&id_list=&start=0&max_results=3