APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency
Authors: Yupu Yao, Shangqi Deng, Zihan Cao, Harry Zhang, Liang-Jian Deng
What
This paper introduces APLA, a novel text-to-video generation network structure based on diffusion models, which leverages an additional compact network called Video Generation Transformer (VGT) to enhance the consistency of generated videos by extracting and utilizing inherent information from the input video.
Why
This paper addresses the limitations of existing video generation diffusion models in maintaining consistency across frames, particularly in retaining local details. It proposes a novel approach using VGT and adversarial training to improve the temporal coherence and overall quality of generated videos, marking a significant step towards high-fidelity video generation.
How
The authors propose APLA, which adds VGT on top of pre-trained diffusion models. VGT, designed in two variants (pure Transformer decoder and a hybrid with 3D convolution), extracts inherent information from the input video. The authors introduce a hyper-loss function combining MSE, L1, and perceptual loss for better latent noise fitting. Furthermore, they incorporate adversarial training with a 1x1 convolutional discriminator to enhance the robustness and quality of the generated videos. Experiments were conducted on the DAVIS dataset, comparing APLA with existing methods using CLIP score and FCI metrics. Ablation studies were also performed to evaluate the impact of each component in APLA.
Result
APLA demonstrates superior performance in generating consistent and high-quality videos compared to existing methods. Notably, it shows significant improvement in retaining local details across frames, addressing a key limitation of previous diffusion models. Quantitative evaluations using CLIP score and FCI confirm APLA’s enhanced content and frame consistency, achieving state-of-the-art results. Ablation studies confirm that each component of APLA contributes to the overall performance, with the full model achieving the best results, showcasing the effectiveness of combining VGT, hyper-loss, and adversarial training.
LF
The authors acknowledge limitations regarding the computational cost of APLA, which requires more time for inference compared to some existing methods. For future work, exploring more efficient architectures for VGT to reduce computational complexity is suggested. Additionally, investigating the generalization capabilities of APLA on a wider range of datasets and exploring its application to other video generation tasks, such as video prediction or video editing, could be promising directions.
Abstract
Diffusion models have exhibited promising progress in video generation. However, they often struggle to retain consistent details within local regions across frames. One underlying cause is that traditional diffusion models approximate Gaussian noise distribution by utilizing predictive noise, without fully accounting for the impact of inherent information within the input itself. Additionally, these models emphasize the distinction between predictions and references, neglecting information intrinsic to the videos. To address this limitation, inspired by the self-attention mechanism, we propose a novel text-to-video (T2V) generation network structure based on diffusion models, dubbed Additional Perturbation for Latent noise with Adversarial training (APLA). Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. Notably, we introduce an additional compact network, known as the Video Generation Transformer (VGT). This auxiliary component is designed to extract perturbations from the inherent information contained within the input, thereby refining inconsistent pixels during temporal predictions. We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video. Experiments demonstrate a noticeable improvement in the consistency of the generated videos both qualitatively and quantitatively.