Single and Multi-Hop Question-Answering Datasets for Reticular Chemistry with GPT-4-Turbo

Kavli Affiliate: Omar M. Yaghi

| First 5 Authors: Nakul Rampal, Kaiyu Wang, Matthew Burigana, Lingxiang Hou, Juri Al-Johani

| Summary:

The rapid advancement in artificial intelligence and natural language
processing has led to the development of large-scale datasets aimed at
benchmarking the performance of machine learning models. Herein, we introduce
‘RetChemQA,’ a comprehensive benchmark dataset designed to evaluate the
capabilities of such models in the domain of reticular chemistry. This dataset
includes both single-hop and multi-hop question-answer pairs, encompassing
approximately 45,000 Q&As for each type. The questions have been extracted from
an extensive corpus of literature containing about 2,530 research papers from
publishers including NAS, ACS, RSC, Elsevier, and Nature Publishing Group,
among others. The dataset has been generated using OpenAI’s GPT-4 Turbo, a
cutting-edge model known for its exceptional language understanding and
generation capabilities. In addition to the Q&A dataset, we also release a
dataset of synthesis conditions extracted from the corpus of literature used in
this study. The aim of RetChemQA is to provide a robust platform for the
development and evaluation of advanced machine learning algorithms,
particularly for the reticular chemistry community. The dataset is structured
to reflect the complexities and nuances of real-world scientific discourse,
thereby enabling nuanced performance assessments across a variety of tasks. The
dataset is available at the following link:
https://github.com/nakulrampal/RetChemQA

| Search Query: ArXiv Query: search_query=au:”Omar M. Yaghi”&id_list=&start=0&max_results=3