Our framework consists of three parts. First, we extract meshes from a single input image and use a 3D engine to create and preview a coarse animation. Next, we generate the keyframes using a 3D-guided process with an image diffusion model customized for the input image, incorporating multi-view augmentation and extended attention. Finally, we perform 3D-guided interpolation between generated keyframes to produce a high-quality, consistent video.
Qualitative comparison with baselines: (1st) human-like characters, (2nd panel) non-human objects. For human-like characters, MagicPose struggles with pose control (blue), and AnimateAnyone fails to preserve appearance (red). For non-human objects, MotionBooth shows overfitting (blue), and DragAnything shows error accumulation (red). ISculpting exhibits frame inconsistency (yellow) for both categories. Our method outperforms them by following the geometry guidance of coarse renderings but resolves their artifacts (pink).
(1) LoRA customization; (2) Extended Attention; (3) 3D-guided video interpolation; (4) Single-stage vs. two-stage video generation; (5) 3D reconstruction for background; (6) 3D-guided video generation vs. video refinement.
In addition to object animation and camera movement, users can further add, copy, replace, and edit 3D objects to compose a new 3D scene that guides video generation.
Users can adjust the intensity of 3D guidance (by modifying feature injection and ControlNet scale) to leverage more motion priors of the video generation model to increase dynamism.
Background reconstruction facilitates large camera movement in video generation as demonstrated in the above examples.