👁️📑 Multimodal RAG Demo with Nemotron Embed VL and Rerank VL

Input an image or text about food and get recipe images/text back.

This is a scalable workflow that can lend itself to many use cases such as business document retrieval, technical manual look ups and more.

By default it returns the top 3 results from a database of 10,000+ recipes. We've limited it to 3 for the demo but in practice you could return as many as you like.

Dataset used: https://huggingface.co/datasets/mrdbourke/recipe-synthetic-images-10k
Embedding model used: https://huggingface.co/nvidia/llama-nemotron-embed-vl-1b-v2
- Note: By default we use the image + text embeddings as we have access to image and text pairs in our dataset, and according to the launch blog post, these work the best.
Rerank model used: https://huggingface.co/nvidia/llama-nemotron-rerank-vl-1b-v2
Generation model used: https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct (note: you could use a larger model such as Nemotron v3, however, this will require more compute resources)

👁️📑 Multimodal RAG Demo with Nemotron Embed VL and Rerank VL

Query Input

Retrieved Results

Example Queries