Continuous Layout Editing of Single Images with Diffusion Models

Zhiyuan Zhang\(^{1*}\), Zhitong Huang\(^{1*}\) , Jing Liao\(^{1 \dagger}\)

\(^1\): City University of Hong Kong, Hong Kong SAR, China
\(^*\) : Both authors contributed equally to this research \(^\dagger\) : Corresponding author

Paper (HighRes | LowRes)| Supplementary | Code

Video


Abstract

Recent advancements in large-scale text-to-image diffusion models have enabled many applications in image editing. However, none of these methods have been able to edit the layout of single existing images. To address this gap, we propose the first framework for layout editing of a single image while preserving its visual properties, thus allowing for continuous editing on a single image. Our approach is achieved through two key modules. First, to preserve the characteristics of multiple objects within an image, we disentangle the concepts of different objects and embed them into separate textual tokens using a novel method called masked textual inversion. Next, we propose a training-free optimization method to perform layout control for a pre-trained diffusion model, which allows us to regenerate images with learned concepts and align them with user-specified layouts. As the first framework to edit the layout of existing images, we demonstrate that our method is effective and outperforms other baselines that were modified to support this task. Our code will be freely available for public use upon acceptance.

Methods

Learn Concepts of Multiple Objects within Single Image

Our methods can be divided into two stages. In the first stage, we learn the concepts of multiple objects from a single input image \(\it I\) into text tokens \(\it v\)\(_{1}\), \(\it v\)\(_{2}\), ..., \(\it v\)\(_{N}\) with masked textual inversion, where the region of each object is specified by mask \(\it M\)\(_{1}\), \(\it M\)\(_{2}\), ..., \(\it M\)\(_{N}\). We further learn the details of the objects by finetuning the diffusion model \({\epsilon}_{\theta}\) and optimizing the appended tokens \(\it v\)\(_{[1]}\), ..., \(\it v\)\(_{[L]}\).
Training-Free Layout Editing

After the first stage, we get the optimized text tokens for objects \(\it y\)\(_{*}\) = \([\it v\)\(_{1*}\), ..., \(\it v\)\(_{N*}]\) and the finetuned model \({\epsilon}_{\theta*}\). In the second stage, we rearrange the positions of the objects according to the user-specified layout map \(\it I\)\(_{L}\) through a training-free layout editing method with iterative optimization.

Comparison and Application

Comparison

Qualitative comparison with other baseline methods. From left to right: (a) Input images; (b) Target layout; (c) Image-level manipulation; (d) Latent-level manipulation; (e) GLIGEN + textual inversion; (f) MultiDiffusion + Dreambooth; (g) Ours
Application

Continuous layout editing with different target layouts

Ablation Study

Ablation study on different textual inversion methods. From left to right: (a) Input images; (b) Target layouts; (c) Inversion with Dreambooth; (d) Textual inversion + finetune; (e) Masked textual inversion w/o finetune; (f) Our full inversion method with masked textual inversion and finetune.

Ablation study on differen layout control methods. From left to right: (a) Input images; (b) Target layouts; (c) Our inversion + ControlNet; (d) Our inversion + GLIGEN; (e) Our inversion + MultiDiffusion; (f) Our full methods.

Ablation study on optimization loss of layout control.

Ablation study on iterative optimization. The number above the image indicates the denoising step on which iterative optimization is applied. If more than one number is labeled, iterative optimization is applied to multiple denoising steps.

Ablation study on blending steps. The number above the image indicates the number of steps where blending is applied.