BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages

Kavli Affiliate: Yi Zhou

| First 5 Authors: Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri

| Summary:

Large language models (LLMs) often lack culture-specific knowledge of daily
life, especially across diverse regions and non-English languages. Existing
benchmarks for evaluating LLMs’ cultural sensitivities are limited to a single
language or collected from online sources such as Wikipedia, which do not
reflect the mundane everyday lifestyles of diverse regions. That is,
information about the food people eat for their birthday celebrations, spices
they typically use, musical instruments youngsters play, or the sports they
practice in school is common cultural knowledge but uncommon in easily
collected online sources, especially for underrepresented cultures. To address
this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate
LLMs’ everyday knowledge across diverse cultures and languages. BLEnD comprises
52.6k question-answer pairs from 16 countries/regions, in 13 different
languages, including low-resource ones such as Amharic, Assamese, Azerbaijani,
Hausa, and Sundanese. We construct the benchmark to include two formats of
questions: short-answer and multiple-choice. We show that LLMs perform better
for cultures that are highly represented online, with a maximum 57.34%
difference in GPT-4, the best-performing model, in the short-answer format. For
cultures represented by mid-to-high-resource languages, LLMs perform better in
their local languages, but for cultures represented by low-resource languages,
LLMs perform better in English than the local languages. We make our dataset
publicly available at: https://github.com/nlee0212/BLEnD.

| Search Query: ArXiv Query: search_query=au:”Yi Zhou”&id_list=&start=0&max_results=3