FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control

Zhiyuan Zhang¹, Can Wang², Dongdong Chen³, Jing Liao¹,

¹City University of Hong Kong,
²The University of Hong Kong, ³Microsoft GenAI

Paper arXiv Supplementary Code (Coming soon)

We present FlexTraj, a multi-granularity, alignment-agnostic trajectory-control video generation model
for all the following tasks.

Abstract

We present FlexTraj, a framework for image-to-video generation with flexible point trajectory control. FlexTraj introduces a unified point-based motion representation that encodes each point with a segmentation ID, a temporally consistent trajectory ID, and an optional color channel for appearance cues, enabling both dense and sparse trajectory control. Instead of injecting trajectory conditions into the video generator through token concatenation or ControlNet, FlexTraj employs an efficient sequence-concatenation scheme that achieves faster convergence, stronger controllability, and more efficient inference, while maintaining robustness under unaligned conditions. To train such a unified point trajectory-controlled video generator, FlexTraj adopts an annealing training strategy that gradually reduces reliance on complete supervision and aligned condition. Experimental results demonstrate that FlexTraj enables multi-granularity, alignment-agnostic trajectory control for video generation, supporting various applications such as motion cloning, drag-based image-to-video, motion interpolation, camera redirection, flexible action control and mesh animations.

Method

First research result visualization

Overview of the FlexTraj framework. Given 3D-tracking points annotated with TrackID, SegID, and optional Color, users can sparsify or shift trajectories to define spatially sparse, temporally sparse, or unaligned controls. These modified trajectories are projected into condition videos (ID-coded and color-cue) and combined with the first frame and text prompt as inputs to a video diffusion model via efficient sequence-concatenation.

Second research result visualization

Comparison of condition-injection frameworks. (a) ControlNet-Style condition injection. (b) Sequence-Concatenation condition injection. (c) Our Efficient Sequence-Concatenation with LoRA and masked attention. (d) Causal mask.

Results

Motion Clone Click for More Results
Task lable: Dense

Meshes to Video Click for More Results
Task lable: Dense

Camera Redirect Click for More Results
Task lable: Dense

Drag to Video (Single Point) Click for More Results
Task lable: Spatially Sparse

Drag to Video (Occlusion) Click for More Results
Task lable: Spatially Sparse

Partial Meshes to Video Click for More Results
Task lable: Spatially Sparse

Sparse Points to Video
Task lable: Spatially Sparse

Motion Interpolation Click for More Results
Task lable: Temporally Sparse

Flexible Action Control (Real-world Video) Click for More Results
Task lable: Unaligned

Flexible Action Control (Synthesis Video) Click for More Results
Task lable: Unaligned

Coarse Meshes to Video
Task lable: Unaligned

BibTeX