ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation

Rotem Shalev-Arkushin¹, Rinon Gal^1,2, Amit H. Bermano¹, Ohad Fried³

¹Tel Aviv University, ²NVIDIA ³Reichman University

Paper Code

Using references broadens the generation capabilities of image generation models.

Given a text prompt, our method, ImageRAG, dynamically retrieves relevant images and provides them to a base text-to-image model (T2I). ImageRAG works with different models, such as SDXL (A) or OmniGen (B, C), and different controls, e.g. text (A, B) or personalization (C).

Abstract

Diffusion models enable high-quality and diverse visual content synthesis. However, they struggle to generate rare or unseen concepts. To address this challenge, we explore the usage of Retrieval-Augmented Generation (RAG) with image generation models.
We propose ImageRAG, a method that dynamically retrieves relevant images based on a given text prompt, and uses them as context to guide the generation process. Prior approaches that used retrieved images to improve generation, trained models specifically for retrieval-based generation. In contrast, ImageRAG leverages the capabilities of existing image conditioning models, and does not require RAG-specific training.
Our approach is highly adaptable and can be applied across different model types, showing significant improvement in generating rare and fine-grained concepts using different base models.

Method

Given a text prompt <p>, we generate an initial image using a text-to-image (T2I) model. Then, we generate retrieval-captions <c_j>, retrieve images from an external database for each caption <i_j>, and use them as references to the model for better generation.

Bottom part: the retrieval-caption generation block. We use a VLM to decide if the initial image matches the given prompt. If not, we ask it to list the missing concepts, and to create a caption that could be used to retrieve appropriate examples for each of these missing concepts.

Rare Concept Generation

Examples of rare concept generation comparisons of SDXL and OmniGen with and without our method.
The left-most image column is the retrieved reference using ImageRAG for each prompt.
OmniGen and SDXL both struggle with the uncommon concepts, sometimes generating completely unrelated images.
However, when using ImageRAG, both models generate the correct concepts.

Evaluation

Comparisons on fine-grained image generation with text-to-image models. We use the ImageNet (Deng et al., 2009), iNaturalist (Van Horn et al., 2018), CUB (Wah et al., 2011), and Aircraft (Maji et al., 2013) datasets. For each set, we report mean (± standard error) CLIP, SigLIP text-to-image similarities, and DINO similarity between real and generated images. Middle rows feature OmniGen-based models, while the bottom features SDXL-based models. In each part, best results are bolded.

To further assess the quality of our results, we conduct a user study with 46 participants and a total of 767 comparisons. The study results are presented above with users preference percentage of our method compared to other methods, in terms of text alignment, visual quality, and overall preference. As shown, the participants favored our method, ImageRAG, over all other methods in all three criteria.

More experiments and ablation studies can be found in the paper, as well as more explanations about each of the above.

Personalization

ImageRAG can work in parallel with personalization methods and enhance their capabilities. For example, although OmniGen can generate images of a subject based on an image, it struggles to generate some concepts. Using references retrieved by our method, it can generate the required result, such as my cat teaching a class of dogs, on a mug, or built from lego.

BibTeX

If you find our work useful, please cite our paper:

@misc{shalevarkushin2025imageragdynamicimageretrieval,
      title={ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation},
      author={Rotem Shalev-Arkushin and Rinon Gal and Amit H. Bermano and Ohad Fried},
      year={2025},
      eprint={2502.09411},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.09411},
}