Kavli Affiliate: Devanand Manoli and Christoph Kirst
| Authors: Shuyu Wang, Kara Quine, Audrey Jordan, Shreya Dasari, Devanand S. Manoli and Christoph Kirst
| Summary:
Tracking animal behavior in naturalistic settings is essential for understanding social dynamics and their neural underpinnings. Pose estimation methods can produce accurate keypoints using framewise inference. However, post hoc tracking steps often struggle to maintain consistent identity over time, particularly during close and rapid social interactions between visually similar animals. We present a pipeline for bidirectional video object segmentation (VOS) to correct identity swaps with much less manual annotation effort, addressing the prohibitive cost of identity correction of pose estimation data in long recordings and large cohorts. Our approach makes use of a state-of-the-art VOS algorithm, Cutie, which leverages both pixel- and object-level representations across multiple memory timescales. By comparing segmentation masks from independent forward and reverse inference runs, we identify localized zones of disagreement and flag them for manual review. When applied to more than 160 hours of dyadic vole interaction videos, our method reduces identity swaps by two orders of magnitude compared to typical pose estimation workflows and requires review of less than 0.3% of frames per video to achieve identity error-free segmentation masks and aligned keypoints. Our approach generalizes to social interactions involving three or more animals, with scalability constrained primarily by behavioral complexity (e.g., complete occlusion of multiple individuals). Our method enables scalable, long-term tracking of unmarked animals in group settings and provides a practical foundation for more naturalistic studies of social behavior. To lower the barrier for researchers facing similar tracking challenges, we provide an accessible graphical user interface for general use.