Identifying and Solving Conditional Image Leakage in Image-to-Video Generation

Abstract

Diffusion models have obtained substantial progress in image-to-video (I2V) generation. However, such models are not fully understood. In this paper, we report a significant but previously overlooked issue in I2V diffusion models (I2V-DMs), namely, conditional image leakage. I2V-DMs tend to over-rely on the conditional image at large time steps, neglecting the crucial task of predicting the clean video from noisy inputs, which results in videos lacking dynamic and vivid motion. We further address this challenge from both inference and training aspects by presenting plug-and-play strategies accordingly. First, we introduce a training-free inference strategy that starts the generation process from an earlier time step to avoid the unreliable late-time steps of I2V-DMs, as well as an initial noise distribution with optimal analytic expressions (Analytic-Init) by minimizing the KL divergence between it and the actual marginal distribution to effectively bridge the training-inference gap. Second, to mitigate conditional image leakage during training, we design a time-dependent noise distribution for the conditional image, which favors high noise levels at large time steps to sufficiently interfere with the conditional image. We validate these strategies on various I2V-DMs using our collected open-domain image benchmark and the UCF101 dataset. Extensive results demonstrate that our methods outperform baselines by producing videos with more dynamic and natural motion without compromising image alignment and temporal consistency.

Conditional Image Leakage

Ideally, I2V-DMs predict clean videos from noisy inputs, using the conditional image as auxiliary content guidance. However, at large time steps, the heavily corrupted input retains minimal video detail, causing the model tends to over-rely on the detailed conditional image and neglect the crucial task of synthesizing video from noisy inputs.

Identifying Conditional Image Leakage in I2V-DMs

Conditional Image Ground Truth Video X₀ Prediction of DynamiCrafter M=0.7T M=0.95T M=T ≈ GT motion < GT motion << GT motion

Conditional Image Ground Truth Video X₀ Prediction of DynamiCrafter-CIL M=0.7T M=0.95T M=T ≈ GT motion ≈ GT motion ≈ GT motion

BibTeX

@article{zhao2024identifying,
  title={Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model},
  author={Zhao, Min and Zhu, Hongzhou and Xiang, Chendong and Zheng, Kaiwen and Li, Chongxuan and Zhu, Jun},
  journal={arXiv preprint arXiv:2406.15735},
  year={2024}
}

Identifying and Solving Conditional Image Leakage in Image-to-Video Generation

Abstract

Conditional Image Leakage

Identifying Conditional Image Leakage in I2V-DMs

Inference Strategy

Training Strategy

Related Links

BibTeX