VILLS: Video-Image Learning to Learn Semantics for Person Re-Identification

Kavli Affiliate: Cheng Peng

| First 5 Authors: Siyuan Huang, Ram Prabhakar, Yuxiang Guo, Rama Chellappa, Cheng Peng

| Summary:

Person Re-identification is a research area with significant real world
applications. Despite recent progress, existing methods face challenges in
robust re-identification in the wild, e.g., by focusing only on a particular
modality and on unreliable patterns such as clothing. A generalized method is
highly desired, but remains elusive to achieve due to issues such as the
trade-off between spatial and temporal resolution and imperfect feature
extraction. We propose VILLS (Video-Image Learning to Learn Semantics), a
self-supervised method that jointly learns spatial and temporal features from
images and videos. VILLS first designs a local semantic extraction module that
adaptively extracts semantically consistent and robust spatial features. Then,
VILLS designs a unified feature learning and adaptation module to represent
image and video modalities in a consistent feature space. By Leveraging
self-supervised, large-scale pre-training, VILLS establishes a new
State-of-The-Art that significantly outperforms existing image and video-based
methods.

| Search Query: ArXiv Query: search_query=au:”Cheng Peng”&id_list=&start=0&max_results=3

Read More