Monkey See, Model Knew: Large Language Models Accurately Predict Visual Brain Responses in Humans AND Non-Human Primates

Kavli Affiliate: George A. Alvarez

| Authors: Colin Conwell, Emalie McMahon, Akshay Jagadeesh, Kasper Vinken, Saloni Sharma, Jacob S. Prince, George Alvarez, Talia Konkle, Margaret Livignstone and Leyla Isik

| Summary:

Recent progress in multimodal AI and ‘language-aligned’ visual representation learning has rekindled debates about the role of language in shaping the human visual system. In particular, the emergent ability of ‘language-aligned’ vision models (e.g. CLIP) – and even pure language models (e.g. BERT) – to predict image-evoked brain activity has led some to suggest that human visual cortex itself may be ‘language-aligned’ in comparable ways. But what would we make of this claim if the same procedures could model visual activity in a species without language? Here, we conducted controlled comparisons of pure-vision, pure-language, and multimodal vision-language models in their prediction of human (N=4) and rhesus macaque (N=6, 5:IT, 1:V1) ventral visual activity to the same set of 1000 captioned natural images (the ‘NSD1000’). The results revealed markedly similar patterns in model predictivity of early and late ventral visual cortex across both species. This suggests that language model predictivity of the human visual system is not necessarily due to the evolution or learning of language perse, but rather to the statistical structure of the visual world that is reflected in natural language.