Diffusion models have obtained substantial progress in image-to-video (I2V) generation. However, such models are not fully understood. In this paper, we report a significant but previously overlooked issue in I2V diffusion models (I2V-DMs), namely, conditional image leakage. I2V-DMs tend to over-rely on the conditional image at large time steps, neglecting the crucial task of predicting the clean video from noisy inputs, which results in videos lacking dynamic and vivid motion. We further address this challenge from both inference and training aspects by presenting plug-and-play strategies accordingly. First, we introduce a training-free inference strategy that starts the generation process from an earlier time step to avoid the unreliable late-time steps of I2V-DMs, as well as an initial noise distribution with optimal analytic expressions (Analytic-Init) by minimizing the KL divergence between it and the actual marginal distribution to effectively bridge the training-inference gap. Second, to mitigate conditional image leakage during training, we design a time-dependent noise distribution for the conditional image, which favors high noise levels at large time steps to sufficiently interfere with the conditional image. We validate these strategies on various I2V-DMs using our collected open-domain image benchmark and the UCF101 dataset. Extensive results demonstrate that our methods outperform baselines by producing videos with more dynamic and natural motion without compromising image alignment and temporal consistency.
Ideally, I2V-DMs predict clean videos from noisy inputs, using the conditional image as auxiliary content guidance. However, at large time steps, the heavily corrupted input retains minimal video detail, causing the model tends to over-rely on the detailed conditional image and neglect the crucial task of synthesizing video from noisy inputs.
This implementation is based on the following work:
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Thanks to the authors for sharing their code and models.
@article{zhao2024identifying,
title={Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model},
author={Zhao, Min and Zhu, Hongzhou and Xiang, Chendong and Zheng, Kaiwen and Li, Chongxuan and Zhu, Jun},
journal={arXiv preprint arXiv:2406.15735},
year={2024}
}