Recently, I’ve been generating many images to serve as illustrations in my Spanish vocabulary courses published on Flashcard Space.
Last week I switched from Stable Diffusion XL to Flux.1 dev. It is a significant upgrade in image quality, but I also noticed the illustrations have less variety than SDXL, at least when I use the same simple prompts as before. When I worked on a flashcard set that teaches names of professions in Spanish, it felt like most illustrations depicted the average-looking middle-aged Caucasian male with a beard as a representative of all professions, and I looked for a way to add variance to my image outputs 🙂
Idea: Using AI to refine prompts for image generator model
People discover interesting prompt engineering tricks to improve outputs of every image generation model. And since we already have powerful and cheap general-purpose LLMs like gpt-4o, we can codify the knowledge about how to make a good prompt tailored to Flux.1 and use AI to preprocess our basic prompt and create a fancier, refined version of it before we pass it to Flux.
My experiments will show images generated using three prompts:
- Base prompt, something like
The road is full of leaves
- AI-preprocessed base prompt created with https://flux1.ai/prompt-generator prompt generator – much more verbose and fancier
- AI-preprocessed base prompt created with https://shop.xerophayze.com/xerogenlite – a different prompt generator where the author put significant work into crafting good LLM prompts
Experiment #1: refining a concrete, non-abstract prompt
Base prompt: The road is full of leaves
Comment: The aesthetics differ, but there isn’t much difference in the image’s composition or idea.
Experiment #2: refining a more abstract prompt
The area where I hope to see the most significant benefit of AI-refining prompts is when they are vague and relate to abstract concepts rather than describing something visual. So as an example, let’s generate an illustration for a flashcard with the following vague sentence:
Base prompt: She is able to solve the problem.
Comment: The results differ in aesthetics, but Flux.1 had the same idea about depicting “solving a problem.” Let me perform one more to see if it can handle even more vague descriptions, this time with idioms.
Experiment #3: ultra abstract prompt with idiomatic expressions
Base prompt: He tried to kill two birds with one stone, but ended up biting off more than he could chew.
Comments: Finally, with a highly vague prompt and two idiomatic expressions, the Flux model gave up and produced useless output. The refined prompts allowed me to create pictures with reasonable aesthetics, but they failed to see too far beyond the literal meaning of idioms.
Discussion
Based on the above, I don’t have a definite conclusion on whether it’s worth adding complexity to a project to preprocess and refine prompts with an LLM model like gpt-4o
before passing the prompt to Flux.1.
It seems that Flux.1 has quite a good understanding of natural language, and as long as we don’t use idioms and abstract ideas, it can handle some ambiguity in prompts. Of course, the value added by preprocessing the prompt will depend on the guidelines we provide to LLM. If we are determined to invest time in it, we can surely guide the model to understand idioms, improve the aesthetics, or enforce a concrete image style.
If you have some experience with pre-processing prompts before sending them to Flux.1 and would like to share your conclusions with other readers, please leave a comment!
No comments yet, you can leave the first one!