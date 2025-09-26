The obtained results are ready for generation in the third stage, which entails multimodal context building. Some systems use early fusion by converting everything into text before generation, while others use late fusion, where different modalities remain separate and the generator attends to them directly. A framework such as LangChain can be used to orchestrate retrieval, fusion and generation of pipelines efficiently. Also, graphs can be used to represent relationships between multimodal entities, helping the system better understand connections across data.

There are three different approaches for converting modalities:

Text-translation approach: The most direct way to incorporate multimodality into RAG is to convert the non-text data types to text at the time of storage and retrieval.1 Images can be converted to captions, audio to transcriptions and tabular data can be represented in a serialized structure with CSV or JSON. It is an easy integration for existing text-based RAG systems because every modality is handled as text. This approach has an information bottleneck because many of the nuances and details of the nontext data that makes it unique can be lost in translation.

Text retrieval with multimodal generation: This more advanced approach still performs retrieval over text embeddings (captions, transcripts, metadata) but during response generation, it allows a multimodal large language model (LLM) to directly access the original nontext data. For example, the retriever might fetch an image by using its caption, but the image itself is passed into the multimodal LLM alongside the query. This hybrid approach improves expressiveness during generation, especially by fine-tuning on domain-specific datasets, though retrieval quality still depends on the initial text representation of the nontext data.

Multimodal retrieval: The most advanced approach involves using multimodal embeddings. These embeddings can be mapped for text, images, audio and video into a common vector space for cross-modal retrieval where a textual query will directly retrieve relevant multimodal data. The multimodal evidence retrieved would be submitted to the multimodal LLM for response generation that includes explicit grounding on multiple data sources. The multimodal LLM would avoid the bottlenecks of text translation and offers a maximum degree of contextual grounding. But this remedy can be costly due to the underlying computational demands and advanced modality-specific encoders. It can also benefit from vision language models and can be further optimized by using summarization to condense retrieved content before generation.



Finally, the response generation stage employs a multimodal LLM to synthesize the output. Because this output is grounded in multimodal evidence, the risk of hallucination is reduced. Depending on design, the final output might be text-only or multimodal, such as a written explanation paired with retrieved images, annotated visuals or tables.

Therefore, multimodal RAG progresses from simple text translation pipelines to hybrid text retrieval with multimodal generation and ultimately, to true multimodal retrieval with shared embeddings. The chosen approach defines the balance between simplicity, efficiency and expressive power.