Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis

Kavli Affiliate: Dan Luo

| First 5 Authors: Zhiqi Huang, Dan Luo, Jun Wang, Huan Liao, Zhiheng Li

| Summary:

Our research introduces an innovative framework for video-to-audio synthesis,
which solves the problems of audio-video desynchronization and semantic loss in
the audio. By incorporating a semantic alignment adapter and a temporal
synchronization adapter, our method significantly improves semantic integrity
and the precision of beat point synchronization, particularly in fast-paced
action sequences. Utilizing a contrastive audio-visual pre-trained encoder, our
model is trained with video and high-quality audio data, improving the quality
of the generated audio. This dual-adapter approach empowers users with enhanced
control over audio semantics and beat effects, allowing the adjustment of the
controller to achieve better results. Extensive experiments substantiate the
effectiveness of our framework in achieving seamless audio-visual alignment.

| Search Query: ArXiv Query: search_query=au:”Dan Luo”&id_list=&start=0&max_results=3