Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

Kavli Affiliate: Wei Gao
| Summary:
Reasoning models are increasingly used in settings where the final answer is not the only object of review: educational tools may show students intermediate steps, decision-support systems may require human oversight, and audit workflows may inspect traces for misleading or biased input. In such settings, two responses can receive the same final-answer score while differing in whether the trace explicitly flags injected biasing content. Accuracy-only evaluation collapses these cases. We study this gap as a measurement blind spot for responsible evaluation and introduce a minimal trace-level diagnostic with two axes: emphsusceptibility (whether the bias breaks a previously correct answer) and emphacknowledgment (whether the trace contains a rubric-defined surface reference to the injected content). Across thousands of biased GSM8K trials, GPT-4o and Claude Sonnet~4 have similar susceptibility rates ($1.3%$ vs. $1.2%$) but substantially different acknowledgment rates ($13.0%$ vs. $75.0%$) under the same rubric.
| Search Query:arXiv Query: search_query=au:”Gao Wei”&id_list=&start=0&max_results=10
Read More