Through the Theory of Mind’s Eye: Reading Minds with Multimodal Video Large Language Models

Kavli Affiliate: Xiang Zhang

| First 5 Authors: Zhawnen Chen, Tianchun Wang, Yizhou Wang, Michal Kosinski, Xiang Zhang

| Summary:

Can large multimodal models have a human-like ability for emotional and
social reasoning, and if so, how does it work? Recent research has discovered
emergent theory-of-mind (ToM) reasoning capabilities in large language models
(LLMs). LLMs can reason about people’s mental states by solving various
text-based ToM tasks that ask questions about the actors’ ToM (e.g., human
belief, desire, intention). However, human reasoning in the wild is often
grounded in dynamic scenes across time. Thus, we consider videos a new medium
for examining spatio-temporal ToM reasoning ability. Specifically, we ask
explicit probing questions about videos with abundant social and emotional
reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning
using video and text. We also enable explicit ToM reasoning by retrieving key
frames for answering a ToM question, which reveals how multimodal LLMs reason
about ToM.

| Search Query: ArXiv Query: search_query=au:”Xiang Zhang”&id_list=&start=0&max_results=3

Read More