I2V3D: Controllable image-to-video generation with 3D guidance

1City University of Hong Kong, 2Microsoft GenAI

Starting with a single image, our method reconstructs the complete scene geometry and uses the CG pipeline to enable precise control of character animation (e.g., keyframe animation or skeleton control) and camera movement. We then apply geometric guidance, based on the coarse rendering results, to generate high-quality, controllable videos.

Abstract

We present I2V3D, a novel framework for animating static images into dynamic videos with precise 3D control, leveraging the strengths of both 3D geometry guidance and advanced generative models. Our approach combines the precision of a computer graphics pipeline, enabling accurate control over elements such as camera movement, object rotation, and character animation, with the visual fidelity of generative AI to produce high-quality videos from coarsely rendered inputs. To support animations with any initial start point and extended sequences, we adopt a two-stage generation process guided by 3D geometry: 1) 3D-Guided Keyframe Generation, where a customized image diffusion model refines rendered keyframes to ensure consistency and quality, and 2) 3D-Guided Video Interpolation, a training-free approach that generates smooth, high-quality video frames between keyframes using bidirectional guidance. Experimental results highlight the effectiveness of our framework in producing controllable, high-quality animations from single input images by harmonizing 3D geometry with generative models. The code for our framework will be publicly released.

Methods

Our framework consists of three parts. First, we extract meshes from a single input image and use a 3D engine to create and preview a coarse animation. Next, we generate the keyframes using a 3D-guided process with an image diffusion model customized for the input image, incorporating multi-view augmentation and extended attention. Finally, we perform 3D-guided interpolation between generated keyframes to produce a high-quality, consistent video.

Results

Comparisons with other baselines

Qualitative comparison with baselines: (1st) human-like characters, (2nd panel) non-human objects. For human-like characters, MagicPose struggles with pose control (blue), and AnimateAnyone fails to preserve appearance (red). For non-human objects, MotionBooth shows overfitting (blue), and DragAnything shows error accumulation (red). ISculpting exhibits frame inconsistency (yellow) for both categories. Our method outperforms them by following the geometry guidance of coarse renderings but resolves their artifacts (pink).

Ablation studies

(1) LoRA customization; (2) Extended Attention; (3) 3D-guided video interpolation; (4) Single-stage vs. two-stage video generation; (5) 3D reconstruction for background; (6) 3D-guided video generation vs. video refinement.

Applications

In addition to object animation and camera movement, users can further add, copy, replace, and edit 3D objects to compose a new 3D scene that guides video generation.

Users can adjust the intensity of 3D guidance (by modifying feature injection and ControlNet scale) to leverage more motion priors of the video generation model to increase dynamism.

Background reconstruction facilitates large camera movement in video generation as demonstrated in the above examples.

BibTeX