Kavli Affiliate: Long Zhang
| First 5 Authors: Peipei Song, Long Zhang, Long Lan, Weidong Chen, Dan Guo
| Summary:
Partially relevant video retrieval (PRVR) is a practical yet challenging task
in text-to-video retrieval, where videos are untrimmed and contain much
background content. The pursuit here is of both effective and efficient
solutions to capture the partial correspondence between text queries and
untrimmed videos. Existing PRVR methods, which typically focus on modeling
multi-scale clip representations, however, suffer from content independence and
information redundancy, impairing retrieval performance. To overcome these
limitations, we propose a simple yet effective approach with active moment
discovering (AMDNet). We are committed to discovering video moments that are
semantically consistent with their queries. By using learnable span anchors to
capture distinct moments and applying masked multi-moment attention to
emphasize salient moments while suppressing redundant backgrounds, we achieve
more compact and informative video representations. To further enhance moment
modeling, we introduce a moment diversity loss to encourage different moments
of distinct regions and a moment relevance loss to promote semantically
query-relevant moments, which cooperate with a partially relevant retrieval
loss for end-to-end optimization. Extensive experiments on two large-scale
video datasets (ie, TVR and ActivityNet Captions) demonstrate the superiority
and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller
(#parameters) while 6.0 points higher (SumR) than the up-to-date method
GMMFormer on TVR.
| Search Query: ArXiv Query: search_query=au:”Long Zhang”&id_list=&start=0&max_results=3