ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation

1Tel Aviv University, 2NVIDIA

TL;DR: We predict a ComfyUI workflow that matches a user's text-to-image prompt. Generating images with these prompt-specific flows improves quality.

Teaser.

The text-to-image user community has largely moved from using monolithic models, to complex workflows that combine fine-tuned base models, LoRAs and embeddings, super resolution steps, prompt refiners and more.

Building effective workflows requires significant expertise because of the large number of available components, their complex interdependence, and their dependence on the generation domain.

We introduce the novel task of prompt-adaptive workflow generation, where the goal is to learn how to automate this process and tailor an effective workflow to each user prompt. We propose two LLM baselines to tackle this task, and show that they offer a new path to improving image generation performance.

ComfyGen can produce high quality results and generalize to diverse domains. All images were created with SDXL-scale models (no FLUX!)

Abstract

The practical use of text-to-image generation has evolved from simple, monolithic models to complex workflows that combine multiple specialized components. While workflow-based approaches can lead to improved image quality, crafting effective workflows requires significant expertise, owing to the large number of available components, their complex inter-dependence, and their dependence on the generation prompt.

Here, we introduce the novel task of \textit{prompt-adaptive workflow generation}, where the goal is to automatically tailor a workflow to each user prompt.

We propose two LLM-based approaches to tackle this task: a tuning-based method that learns from user-preference data, and a training-free method that uses the LLM to select existing flows. Both approaches lead to improved image quality when compared to monolithic models or generic, prompt-independent workflows. Our work shows that prompt-dependent flow prediction offers a new pathway to improving text-to-image generation quality, complementing existing research directions in the field.

How does it work?

We base our work around ComfyUI, an open source tool for designing and executing text-to-image pipelines. These pipelines are represented as a JSON, which is a natural format for an LLM to predict.

To teach the LLM which flows are a good match for a prompt, we collect a set of human-created ComfyUI workflows, and augment them by randomly swapping parameters like the base model, the LoRAs, the sampler or even the number of steps and guidance scales.

We further collect a set of 500 prompts, and use them to generate images with each flow in our set. Then, we score these images using an ensamble of aesthetic and human preference predictors. This gives us a set of (prompt, flow, score) triplets.

We then explore two approaches: First, an in-context approach, where we give the LLM a table of flows and their scores across categories, and ask it to pick one that best matches a new prompt. Second, a fine-tuning approach, where we provide the LLM with the input prompt and a score, and ask it to predict the flow that achieved this score. At inference time, we simply provide the LLM with a prompt and a high score, and ask it to predict a flow that matches it.

Comparisons

We compared our model to two classes of baselines: monolithic models (SDXL, the most popular fine-tuned versions, and a DPO-optimized baseline), and fixed prompt-independent flows. Our approach outperforms them all on both human preference metrics and on prompt-alignment benchmarks

Comparisons on user-created prompts from CivitAI



User study results on user-created prompts from CivitAI



Comparisons on prompts from the GenEval benchmark



GenEval benchmark results

BibTeX

If you find our work useful, please cite our paper:

@misc{gal2024comfygenpromptadaptiveworkflowstexttoimage,
      title={ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation}, 
      author={Rinon Gal and Adi Haviv and Yuval Alaluf and Amit H. Bermano and Daniel Cohen-Or and Gal Chechik},
      year={2024},
      eprint={2410.01731},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.01731}, 
}