CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

Authors: Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng Zheng, Xiaowei Zhou, Qifeng Chen, Yujun Shen

What

This paper introduces Content Deformation Fields (CoDeF), a novel video representation comprising a canonical content field for static content and a temporal deformation field tracking transformations. This representation facilitates applying image algorithms to videos for temporally consistent video processing.

Why

This paper is important as it bridges the gap between advanced image processing algorithms and video processing, offering a method for temporally consistent video editing and manipulation that surpasses previous techniques in quality and efficiency.

How

The authors employ a 2D hash-based image field for the canonical content and a 3D hash-based field for temporal deformation, trained through a rendering pipeline. They introduce techniques like annealed hash encoding and flow-guided consistency loss to ensure semantic correctness and smoothness. The system is evaluated on tasks like video reconstruction, translation, keypoint tracking, object tracking, and super-resolution.

Result

CoDeF achieves superior video reconstruction quality with a 4.4 dB higher PSNR than Neural Image Atlas and significantly faster training (5 minutes vs. 10 hours). It effectively lifts image algorithms to video tasks, demonstrating superior temporal consistency in video-to-video translation, keypoint tracking on non-rigid objects, and object tracking compared to previous methods.

LF

The paper acknowledges limitations regarding per-scene optimization, challenges with extreme viewpoint changes, and handling large non-rigid deformations. Future work may explore feed-forward implicit field techniques, 3D prior knowledge integration, and using multiple canonical images to address these limitations.

Abstract

We present the content deformation field CoDeF as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis.Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline.We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video.With such a design, CoDeF naturally supports lifting image algorithms for video processing, in the sense that one can apply an image algorithm to the canonical image and effortlessly propagate the outcomes to the entire video with the aid of the temporal deformation field.We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.More importantly, thanks to our lifting strategy that deploys the algorithms on only one image, we achieve superior cross-frame consistency in processed videos compared to existing video-to-video translation approaches, and even manage to track non-rigid objects like water and smog.Project page can be found at https://qiuyu96.github.io/CoDeF/.