MoEController: Instruction-based Arbitrary Image Manipulation with Mixture-of-Expert Controllers

Authors: Sijia Li, Chen Chen, Haonan Lu

What

This paper introduces MoEController, a novel method for arbitrary image manipulation guided by text instructions, tackling the challenge of performing both global and local image editing in a unified framework.

Why

This paper is important as it addresses the limitations of existing image manipulation methods that struggle to effectively handle both global and local edits based on open-domain text instructions. It proposes a novel approach using a mixture-of-expert (MOE) framework to enhance the model’s adaptability to diverse image manipulation tasks.

How

The authors first create a large-scale dataset for global image manipulation using ChatGPT to generate target captions and ControlNet to generate image pairs. They then design an MOE model with a fusion module, multiple expert models, and a gate system to discriminate between different instruction semantics and adapt to specific tasks. The model is trained with a reconstruction loss to ensure image entity consistency.

Result

MoEController demonstrates superior performance in both global and local image manipulation tasks compared to existing methods. It effectively handles complex style transfers, local edits, and object manipulations. Quantitative evaluations using CLIP metrics and user studies confirm its effectiveness and adaptability to open-domain instructions.

LF

The authors suggest extending MoEController to handle a wider range of human instructions and more complex image manipulation tasks in the future. Further exploration of expert model design and optimization of the gating mechanism could further improve performance.

Abstract

Diffusion-model-based text-guided image generation has recently made astounding progress, producing fascinating results in open-domain image manipulation tasks. Few models, however, currently have complete zero-shot capabilities for both global and local image editing due to the complexity and diversity of image manipulation tasks. In this work, we propose a method with a mixture-of-expert (MOE) controllers to align the text-guided capacity of diffusion models with different kinds of human instructions, enabling our model to handle various open-domain image manipulation tasks with natural language instructions. First, we use large language models (ChatGPT) and conditional image synthesis models (ControlNet) to generate a large number of global image transfer dataset in addition to the instruction-based local image editing dataset. Then, using an MOE technique and task-specific adaptation training on a large-scale dataset, our conditional diffusion model can edit images globally and locally. Extensive experiments demonstrate that our approach performs surprisingly well on various image manipulation tasks when dealing with open-domain images and arbitrary human instructions. Please refer to our project page: [https://oppo-mente-lab.github.io/moe_controller/]