Kavli Affiliate: Brian Caffo and Joshua Vogelstein
| Authors: Eric W. Bridgeford, Michael Powell, Gregory Kiar, Stephanie Noble, Jaewon Chung, Sambit Panda, Ross Lawrence, Ting Xu, Michael Milham, Brian Caffo and Joshua T. Vogelstein
| Summary:
Batch effects, undesirable sources of variance across multiple experiments, present significant challenges for scientific and clinical discoveries. Specifically, batch effects can (i) produce spurious signals and/or (ii) obscure genuine signals, contributing to the ongoing reproducibility crisis. Typically, batch effects are modeled as classical, rather than causal, statistical effects. This model choice renders the methods unable to differentiate between biological or experimental sources of variability, leading to unnecessary false positive and negative effect detections and over-confidence. We formalize batch effects as causal effects to address these concerns, and augment existing batch effect detection and correction approaches with causal machinery. Simulations illustrate that our causal approaches mitigate spurious findings and reveal otherwise obscured signals as compared to non-causal approaches. Applying our causal methods to a large neuroimaging mega-study reveals instances where prior art confidently asserts that the data do not support the presence of batch effects when we expect to detect them. On the other hand, our causal methods correctly discern that there exists irreducible confounding in the data, so it is unclear whether differences are due to batches or not. This work therefore provides a framework for understanding the potential capabilities and limitations of analysis of multi-site data using causal machinery.