Diffusion models enable high-quality and diverse visual content synthesis.
However, they struggle to generate rare or unseen concepts.
To address this challenge, we explore the usage of Retrieval-Augmented Generation (RAG) with image generation models.
We propose ImageRAG, a method that dynamically retrieves relevant images based on a given text prompt, and uses them as context to guide the generation process.
Prior approaches that used retrieved images to improve generation, trained models specifically for retrieval-based generation.
In contrast, ImageRAG leverages the capabilities of existing image conditioning models, and does not require RAG-specific training.
Our approach is highly adaptable and can be applied across different model types, showing significant improvement in generating rare and fine-grained concepts using different base models.
Given a text prompt <p>, we generate an initial image using a text-to-image (T2I) model. Then, we generate retrieval-captions <cj>, retrieve images from an external database for each caption <ij>, and use them as references to the model for better generation.
Bottom part: the retrieval-caption generation block. We use a VLM to decide if the initial image matches the given prompt. If not, we ask it to list the missing concepts, and to create a caption that could be used to retrieve appropriate examples for each of these missing concepts.
Examples of rare concept generation comparisons of SDXL and OmniGen with and without our method.
The left-most image column is the retrieved reference using ImageRAG for each prompt.
OmniGen and SDXL both struggle with the uncommon concepts,
sometimes generating completely unrelated images.
However, when using ImageRAG, both models generate the correct concepts.
Comparisons on fine-grained image generation with text-to-image models. We use the ImageNet (Deng et al., 2009), iNaturalist (Van Horn et al., 2018), CUB (Wah et al., 2011), and Aircraft (Maji et al., 2013) datasets. For each set, we report mean (± standard error) CLIP, SigLIP text-to-image similarities, and DINO similarity between real and generated images. Middle rows feature OmniGen-based models, while the bottom features SDXL-based models. In each part, best results are bolded.
To further assess the quality of our results, we conduct a user study with 46 participants and a total of 767 comparisons. The study results are presented above with users preference percentage of our method compared to other methods, in terms of text alignment, visual quality, and overall preference. As shown, the participants favored our method, ImageRAG, over all other methods in all three criteria.
More experiments and ablation studies can be found in the paper, as well as more explanations about each of the above.
ImageRAG can work in parallel with personalization methods and enhance their capabilities. For example, although OmniGen can generate images of a subject based on an image, it struggles to generate some concepts. Using references retrieved by our method, it can generate the required result, such as my cat teaching a class of dogs, on a mug, or built from lego.
If you find our work useful, please cite our paper:
@misc{shalevarkushin2025imageragdynamicimageretrieval,
title={ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation},
author={Rotem Shalev-Arkushin and Rinon Gal and Amit H. Bermano and Ohad Fried},
year={2025},
eprint={2502.09411},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.09411},
}