It’s easy to fool yourself: Case studies on identifying bias and confounding in bio-medical datasets

Kavli Affiliate: Michael Brenner

| First 5 Authors: Subhashini Venugopalan, Arunachalam Narayanaswamy, Samuel Yang, Anton Geraschenko, Scott Lipnick

| Summary:

Confounding variables are a well known source of nuisance in biomedical
studies. They present an even greater challenge when we combine them with
black-box machine learning techniques that operate on raw data. This work
presents two case studies. In one, we discovered biases arising from systematic
errors in the data generation process. In the other, we found a spurious source
of signal unrelated to the prediction task at hand. In both cases, our
prediction models performed well but under careful examination hidden
confounders and biases were revealed. These are cautionary tales on the limits
of using machine learning techniques on raw data from scientific experiments.

| Search Query: ArXiv Query: search_query=au:”Michael Brenner”&id_list=&start=0&max_results=3