Identifying and Solving Conditional Image Leakage in Image-to-Video Generation

1Dept. of Comp. Sci. & Tech., BNRist Center, THU-Bosch ML Center, Tsinghua University 2Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Beijing Key Laboratory of Big Data Management and Analysis Methods , Beijing, China
3ShengShu, Beijing, China 4Pazhou Laboratory (Huangpu), Guangzhou, China

Abstract

Diffusion models have obtained substantial progress in image-to-video (I2V) generation. However, such models are not fully understood. In this paper, we report a significant but previously overlooked issue in I2V diffusion models (I2V-DMs), namely, conditional image leakage. I2V-DMs tend to over-rely on the conditional image at large time steps, neglecting the crucial task of predicting the clean video from noisy inputs, which results in videos lacking dynamic and vivid motion. We further address this challenge from both inference and training aspects by presenting plug-and-play strategies accordingly. First, we introduce a training-free inference strategy that starts the generation process from an earlier time step to avoid the unreliable late-time steps of I2V-DMs, as well as an initial noise distribution with optimal analytic expressions (Analytic-Init) by minimizing the KL divergence between it and the actual marginal distribution to effectively bridge the training-inference gap. Second, to mitigate conditional image leakage during training, we design a time-dependent noise distribution for the conditional image, which favors high noise levels at large time steps to sufficiently interfere with the conditional image. We validate these strategies on various I2V-DMs using our collected open-domain image benchmark and the UCF101 dataset. Extensive results demonstrate that our methods outperform baselines by producing videos with more dynamic and natural motion without compromising image alignment and temporal consistency.


Conditional Image Leakage

Our method.

Ideally, I2V-DMs predict clean videos from noisy inputs, using the conditional image as auxiliary content guidance. However, at large time steps, the heavily corrupted input retains minimal video detail, causing the model tends to over-rely on the detailed conditional image and neglect the crucial task of synthesizing video from noisy inputs.

Identifying Conditional Image Leakage in I2V-DMs





Image
Conditional Image Ground Truth Video X0 Prediction of DynamiCrafter M=0.7T M=0.95T M=T ≈ GT motion < GT motion << GT motion




Image
Conditional Image Ground Truth Video X0 Prediction of DynamiCrafter-CIL M=0.7T M=0.95T M=T ≈ GT motion ≈ GT motion ≈ GT motion


Inference Strategy





Conditional Image DynamiCrafter + Our Inference Strategy

A kitten lying on the bed.


A duck swimming in the lake.


A girl walks up the steps of a palace.




Conditional Image VideoCrafter1 + Our Inference Strategy

The sun sets on the horizon, casting a golden glow over the turbulent sea.


A duck swimming in the lake.


A fawn Pembroke Welsh Corgi walking in slow-motion in Times Square, in cubist painting style.




Conditional Image SVD + Our Inference Strategy


Conditional Image DynamiCrafter-finetune + Our Inference Strategy

A duck swimming in the lake.


A soldier riding a horse.


Fireworks exploding in the sky.




Conditional Image VideoCrafter1-finetune + Our Inference Strategy

A cartoon girl with brown curly hair splashes joyfully in a bubble-filled bathtub.


A man riding motor on a mountain road.


A plate full of food, with camera spinning.




Conditional Image SVD-finetune + Our Inference Strategy

Training Strategy





Conditional Image DynamiCrafter-finetune + Our Training Strategy

Donkeys in traditional attire gallop across a lush green meadow.


Mountains under the starlight.


Rabbits playing in a river.




Conditional Image VideoCrafter-finetune + Our Training Strategy

A couple hugging a cat.


Mystical hills with a glowing blue portal.


A woman with flowing, curly silver hair and dark eyes.




Conditional Image SVD-finetune + Our Training Strategy



BibTeX

@article{zhao2024Identifying,
      title={Identifying and Solving Conditional Image Leakage
    in Image-to-Video Diffusion Model},
      author={Min Zhao, Hongzhou Zhu, Chendong Xiang, Kaiwen Zheng, Chongxuan Li and Jun Zhu},
      journal={arXiv preprint arXiv:2406.15735},
      year={2024}
}