MIM: Multi-modal Content Interest Modeling Paradigm for User Behavior Modeling

Kavli Affiliate: Xiang Zhang

| First 5 Authors: Bencheng Yan, Si Chen, Shichang Jia, Jianyu Liu, Yueran Liu

| Summary:

Click-Through Rate (CTR) prediction is a crucial task in recommendation
systems, online searches, and advertising platforms, where accurately capturing
users’ real interests in content is essential for performance. However,
existing methods heavily rely on ID embeddings, which fail to reflect users’
true preferences for content such as images and titles. This limitation becomes
particularly evident in cold-start and long-tail scenarios, where traditional
approaches struggle to deliver effective results. To address these challenges,
we propose a novel Multi-modal Content Interest Modeling paradigm (MIM), which
consists of three key stages: Pre-training, Content-Interest-Aware Supervised
Fine-Tuning (C-SFT), and Content-Interest-Aware UBM (CiUBM). The pre-training
stage adapts foundational models to domain-specific data, enabling the
extraction of high-quality multi-modal embeddings. The C-SFT stage bridges the
semantic gap between content and user interests by leveraging user behavior
signals to guide the alignment of embeddings with user preferences. Finally,
the CiUBM stage integrates multi-modal embeddings and ID-based collaborative
filtering signals into a unified framework. Comprehensive offline experiments
and online A/B tests conducted on the Taobao, one of the world’s largest
e-commerce platforms, demonstrated the effectiveness and efficiency of MIM
method. The method has been successfully deployed online, achieving a
significant increase of +14.14% in CTR and +4.12% in RPM, showcasing its
industrial applicability and substantial impact on platform performance. To
promote further research, we have publicly released the code and dataset at
https://pan.quark.cn/s/8fc8ec3e74f3.

| Search Query: ArXiv Query: search_query=au:”Xiang Zhang”&id_list=&start=0&max_results=3

Read More