FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control

Zhiyuan Zhang1, Can Wang2, Dongdong Chen3, Jing Liao1,
1City University of Hong Kong,
2The University of Hong Kong, 3Microsoft GenAI

We present FlexTraj, a multi-granularity, alignment-agnostic trajectory-control video generation model
for all the following tasks.

Abstract

We present FlexTraj, a framework for image-to-video generation with flexible point trajectory control. FlexTraj introduces a unified point-based motion representation that encodes each point with a segmentation ID, a temporally consistent trajectory ID, and an optional color channel for appearance cues, enabling both dense and sparse trajectory control. Instead of injecting trajectory conditions into the video generator through token concatenation or ControlNet, FlexTraj employs an efficient sequence-concatenation scheme that achieves faster convergence, stronger controllability, and more efficient inference, while maintaining robustness under unaligned conditions. To train such a unified point trajectory-controlled video generator, FlexTraj adopts an annealing training strategy that gradually reduces reliance on complete supervision and aligned condition. Experimental results demonstrate that FlexTraj enables multi-granularity, alignment-agnostic trajectory control for video generation, supporting various applications such as motion cloning, drag-based image-to-video, motion interpolation, camera redirection, flexible action control and mesh animations.

Method

Results

Motion Clone Click for More Results

Task lable: Dense

Meshes to Video Click for More Results

Task lable: Dense

Camera Redirect Click for More Results

Task lable: Dense

Drag to Video (Single Point) Click for More Results

Task lable: Spatially Sparse

Drag to Video (Occlusion) Click for More Results

Task lable: Spatially Sparse

Partial Meshes to Video Click for More Results

Task lable: Spatially Sparse

Sparse Points to Video

Task lable: Spatially Sparse

Motion Interpolation Click for More Results

Task lable: Temporally Sparse

Flexible Action Control (Real-world Video) Click for More Results

Task lable: Unaligned

Flexible Action Control (Synthesis Video) Click for More Results

Task lable: Unaligned

Coarse Meshes to Video

Task lable: Unaligned

BibTeX