👁️📑 Multimodal RAG Demo with Nemotron Embed VL and Rerank VL
Input an image or text about food and get recipe images/text back.
This is a scalable workflow that can lend itself to many use cases such as business document retrieval, technical manual look ups and more.
By default it returns the top 3 results from a database of 10,000+ recipes. We've limited it to 3 for the demo but in practice you could return as many as you like.
- Dataset used: https://huggingface.co/datasets/mrdbourke/recipe-synthetic-images-10k
- Embedding model used: https://huggingface.co/nvidia/llama-nemotron-embed-vl-1b-v2
- Note: By default we use the image + text embeddings as we have access to image and text pairs in our dataset, and according to the launch blog post, these work the best.
- Rerank model used: https://huggingface.co/nvidia/llama-nemotron-rerank-vl-1b-v2
- Generation model used: https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct (note: you could use a larger model such as Nemotron v3, however, this will require more compute resources)
Query Input
Retrieved Results
Example Queries
Example Queries