What is LlamaIndex?

Authors

AI Advocate & Technology Writer

Data Scientist

IBM

LlamaIndex is an open source data orchestration framework for building large language model (LLM) applications. LlamaIndex is available in Python and TypeScript and leverages a combination of tools and capabilities that simplify the process of context augmentation for generative AI (gen AI) use cases through a Retrieval-Augmented (RAG) pipeline.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

How does LlamaIndex work?

LlamaIndex allows users to curate and organize their private or custom data through data integration and context augmentation.

Context augmentation

Context augmentation is when data is provided to the LLM context window, essentially augmenting LLMs with private or external data.

Widely used open source LLMs are pretrained on large amounts of public data. These large-scale LLMs help solve many real-world problems. However, they take a considerable amount of time and resources to train for a specific use case. Additionally, a model’s internal data is only as up to date through the time during which it was pretrained. Depending on the model, context augmentation is necessary for models to reflect real-time awareness of current events.

Foundation models are gaining popularity as they are flexible, reusable AI models that can be applied to nearly any domain or task. Foundation models similar to the IBM Granite™ series are trained on curated data, however whether the model is trained on a large data set or a pretrained foundation model, they will likely need to be trained on external sources of domain-specific data. This process is facilitated by a system framework that connects the LLM to private data to tailor it to the use case or overall goal of the application. The part of the application stack that facilitates connecting LLMs to custom data sources are data frameworks such as LlamaIndex.

Data integration

Data comes from many sources in many formats; it is up to the data framework to ingest, transform and organize the data for LLMs to use. Data is often siloed and unstructured. To obtain and structure it, a data framework such as LlamaIndex, must run the data through a process commonly called the ingestion pipeline.

Once the data is ingested and transformed into a format the LLM can use, the next step is to convert the information into a data structure for indexing. The common procedure is converting unstructured data into vector embeddings. This process is called “creating an embedding” in natural language processing (NLP) but is referred to as “indexing” in data terminology.¹ Indexing is necessary because it enables the LLM to query and retrieve the ingested data by the vector index. The data can be indexed according to the chosen querying strategy.

Data integration facilitates context augmentation by integrating private data into the context window or “knowledge base” of the LLM. IBM’s Granite 3B and 8B models’ context window length was recently extended to 128,000 tokens.² A larger context window enables the model to retain more text in its working memory, which enhances its ability to track key details throughout extended conversations, lengthy and codebases. This capability allows LLM-chatbots to produce responses that are coherent both in the short term and over a more extended context.

However, even with an extended context window, a fine-tuned model might incur considerable costs in both training and inference. Fine-tuning models with specific or private data requires data transformations and systems that promote efficient data retrieval methods for LLM prompting. RAG methodology is seen as a promising option to facilitate long-context language modeling.³

Retrieval Augmented Generation (RAG)

RAG is one of the most known and used methods for context augmentation. RAG enables LLMs to leverage a specialized knowledge base, enhancing LLMs’ ability to provide more precise answers to questions.⁴The typical retrieval augmentation process is performed with three steps:

Chunking: The long-sequence input is partitioned into chunks.
Embedding: Each chunk is encoded into an embedding that can be processed by the LLM.
Retrieval: The most useful embedded chunks are fetched based on the query.

Data frameworks such as LlamaIndex streamline the data ingestion and retrieval process by providing comprehensive API calls for every step in the RAG pattern. This process is powered by the concept of a query engine that allows users to ask questions over their data. Querying over external data and prompting LLMs with contextual knowledge makes it possible to create domain-specific LLM applications.

AI Academy

Why foundation models are a paradigm shift for AI

Learn about a new class of flexible, reusable AI models that can unlock new revenue, reduce costs and increase productivity, then use our guidebook to dive deeper.

Go to episode

LlamaIndex data workflow

LlamaIndex uses RAG to add and connect external data to the datapool that LLMs already have access to. Applications including query engines, chatbots and agents use RAG techniques to complete tasks.⁵

LlamaIndex’s workflow can be broken down into a few steps:

Data ingestion (loading)
Indexing and storing
Querying

The goal of this workflow is to help ingest, structure and allow LLMs to access private or domain-specific data. Having access to more relevant data allows LLMs to respond more accurately to prompts whether it be for building chatbots or query engines.

Data ingestion (loading)

Data ingestion or “loading” is the first step to connecting external data sources to an LLM. LlamaIndex refers to data ingestion as loading the data for the application to use. Most private or custom data can be siloed in formats such as application programming interfaces (APIs), PDFs, images, structured query language databases (SQL) and many more. LlamaIndex can load in over 160 different data formats including structured, semistructured and unstructured datasets.

Data connectors, also known as data “loaders,” fetch and ingest data from its native source. The gathered data is transformed into a collection of data and metadata. These collections are called “Documents” in LlamaIndex. Data connectors, or “Readers” in LlamaIndex, ingest and load various data formats. LlamaIndex features an integrated reader that converts all files in each directory into documents, including Markdown, PDFs, Word Documents, PowerPoint presentation, images, audio files and videos.⁶ To account for other data formats not included in the built-in functionality, data connectors are available through LlamaHub, a registry of open source data loaders. This step in the workflow builds a knowledge base to form indexes over the data so it can be queried and used by LLMs.

Indexing and storing

Once the data has been ingested, the data framework needs to transform and organize the data into a structure that can be retrieved by the LLM. Data indexes structure the data into representations that LLMs can use. LlamaIndex offers several different index types that are complementary to the application’s querying strategy, including: a vector store index, a summary index and a knowledge graph index.

Once the data has been loaded and indexed, the data can be stored. LlamaIndex supports many vector stores that vary in architecture, complexity and cost. By default, LlamaIndex stores all indexed data only in memory.

Vector store index

The vector store index is commonly used in LLM applications following the RAG pattern because it excels at handling natural language queries. Retrieval accuracy from natural language queries depends on semantic search, or search based on meaning rather than keyword matching.⁷Semantic search is enabled by converting input data into vector embeddings. A vector embedding is a numerical representation of the semantics of the data that the LLM can process. The mathematical relationship between vector embeddings allows LLMs to retrieve data based on the meaning of the query terms for more context-rich responses.

The `VectorStoreIndex` takes the data collections or `Document` objects and splits them into `Node`s which are atomic units of data that represents a “chunk” of the source data (`Document`). Once the data collection is split into chunks, vector embeddings of each chunk of data are created. The integrated data is now in a format that the LLM can run queries over. These vector indexes can be stored to avoid reindexing. The simplest way to store data is to persist it to disk, but LlamaIndex integrates with multiple vector databases and embedding models.⁸

To search embeddings, the user query is first converted into a vector embedding. Next, a mathematical process is used to rank all embeddings based on their semantic similarity to the query.

The vector store index uses top-k semantic retrieval to return the most similar embeddings as their corresponding chunks of text. LlamaIndex is designed to facilitate the creation and management of large-scale indexes for efficient information retrieval.

Querying

The final step in the workflow is the implementation of query engines to handle prompt calls to LLMs. Querying consists of three distinct stages, retrieval, postprocessing and response synthesis. Retrieval is when the most relevant documents are fetched and returned from the vector index. Postprocessing is when the embedding chunks or `Nodes` retrieved are optionally refitted by reranking, transforming or filtering. Response synthesis is when the most-relevant data and prompt are combined and sent to the LLM to return a response.

Query engines

Query engines allow users to ask questions over their data, taking in a natural language query and returning a context-rich response.9 Query engines are composed of one or many indexes and retrievers. Many query engines can be used at the same time for data collections with multiple index types. LlamaIndex offers several different query engines for structured and semistructured data, for example a JSON query engine for querying JSON documents.

Data agents

Data agents are LLM-driven AI agents capable of performing a range of tasks on data, including both reading and writing functions.¹⁰LlamaIndex data agents are LLM-powered knowledge works that can perform the following:

Automated search and retrieval over different data types—unstructured, semistructured and structured
API calls to external services that can either be processed immediately, indexed or cached
Storing conversation history
Fulfilling simple and complex data tasks

AI agents can interact with their external environment through a set of APIs and tooling. LlamaIndex has support for OpenAI Function agent (built on top of the OpenAI Function API) and a ReAct agent. The core components of data agents are “a reasoning loop” or the agent’s reasoning paradigm and “tool abstractions” that are the tools themselves.

Reasoning loop

Agents use a reasoning loop, or paradigm, to solve multistep problems. In LlamaIndex, both OpenAI Function and ReAct agents follow a similar pattern to decide which tools to use, as well as the sequence and the parameters to call each tool. The pattern of reasoning is called ReAct or reasoning and action. This process can either be a simple one-step tool selection process or a more complex process where several tools are picked at each step.

Agent tool abstractions

Tool abstractions outline how tools are accessed and used by the agent. LlamaIndex offers tools and ToolSpecs, a Python class that represents a full API specification that an agent can interact with. The base tool abstraction defines a generic interface that can take in a series of arguments and return a generic tool output container that can capture any response. LlamaIndex provides tool abstractions that wrap around the existing data query engines and abstractions for functions that can be used by the tool spec class.

FunctionTool: Converts any function into a tool that can be used by agents.
QueryEngineTool: Enables agents to use query engine search and retrieval.

A tool spec enables users to define complete services rather than individual tools that handle specific tasks. For example, the Gmail tool spec allows the agent to both read and draft emails.¹¹ Here is an example of how that would roughly be defined:

class GmailToolSpec(BaseToolSpec):
    “””Load emails from the user’s account”””
    spec_functions = [“load_data”,“create_draft”, “send_email”]
    def load_data(self) -> List[Document]:
...

def create_draft(
    self,
    to: List[str],
    subject: [str],
    message: [str],
) -> str: “Create and insert a draft email”
...

def send_draft(
    self,
    draft_id: str = None
) -> str: “Send a draft email.”
...

Each function is converted into a tool, using the `FunctionTool` abstraction.

LlamaIndex utilizes LlamaHub’s Tool Repository, consisting of 15+ Tool Specs for agents to interact with. This list consists of services meant to enhance and enrich the agents’ capability to perform different actions. Here are some of several specs included in the repository:

SQL + Vector Database spec
Gmail Spec
Ollama
LangChainLLM

Utility tools

LlamaIndex offers utility tools that can augment the capabilities of existing Tools including:

OnDemandLoaderTool: Transforms any existing LlamaIndex data loader into a tool that can be utilized by an agent.
LoadAndSearchToolSpec: Accepts existing tools as input and produces both a load tool and a search tool as outputs.

Integration with LLMs

LlamaIndex works with open source foundation models such as the IBM Granite™ series, Llama2 and OpenAI and other LLM frameworks including LangChain and Ollama. LLMs can be used in a myriad of ways as stand-alone modules or plugged into other core LlamaIndex modules.¹² LLMs can also be used to drive AI agents to act as knowledge workers that follow autonomous workflows.

AI agents

LlamaIndex has expanded its functionality to include the use of LLM-driven AI agents to act as knowledge workers. LlamaIndex AI agents following a ReAct (reasoning and action) behavioral pattern. ReAct agents follow a reasoning paradigm that instructs the agent to plan and reason through while iteratively improving upon responses. This type of AI agent can be interpreted as a form of chain-of-thought prompting. Agents are able to use tooling to integrate with other LLMs. For example, the LlamaHub collection of data connecter and agent tools.

Use cases

LlamaIndex offers several use case examples in their docs with links to tutorials.¹³

Chatbots: LlamaIndex offers a stateful analogy to a query engine called a chat engine. A chat engine provides a high-level interface for conversations (back and forth interactions rather than a single question and answer). One can think of chat engines as a more personalized version of a ChatGPT application. Chat engine implementations include a mode that utilizes a ReAct agent. LlamaIndex has a tutorial on how to build a chatbot that uses a Data Agent.
Prompting: LlamaIndex uses LLM integration and workflows as event-driven abstractions to chain together prompt calls. Prompting is fundamental for any LLM application.
Question-answering (RAG): LlamaIndex offers the ability to perform RAG over unstructured documents with natural language queries. LlamaIndex also offers a way to query over structured data with text-to-SQL and text-to-Pandas.¹⁴
Structured data extraction: LLMs process natural language and extract semantically significant details such as names, dates, addresses and figures and present them in a consistent, structured format regardless of the original source. Once the data is structured, it can be sent to a database or further analyzed with tools such as LlamaParse.
Autonomous agents: AI agents can be used for various LLM applications. For example, users can build an agentic RAG application by creating a context-augmented research assistant over their data that answers simple questions along with complex research tasks.¹⁵

How to choose the right foundation model

Learn how to choose the right approach in preparing datasets and employing foundation models.

Resources

Explore IBM Granite

Discover IBM® Granite™, our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.

Discover the power of LLMs

Dive into IBM Developer articles, blogs and tutorials to deepen your knowledge of LLMs.

IBM is named a Leader in Data Science & Machine Learning

Learn why IBM has been recognized as a Leader in the 2025 Gartner® Magic Quadrant™ for Data Science and Machine Learning Platforms.

The CEO’s guide to model optimization

Learn how to continually push teams to improve model performance and outpace the competition by using the latest AI techniques and infrastructure.

A differentiated approach to AI foundation models

Explore the value of enterprise-grade foundation models that provide trust, performance and cost-effective benefits to all industries.

Unlock the power of generative AI and ML

Learn how to incorporate generative AI, machine learning and foundation models into your business operations for improved performance.

AI in Action 2024

Read about 2,000 organizations we surveyed about their AI initiatives to discover what's working, what's not and how you can get ahead.

Footnotes

1 Elena Lowery, “Use Watsonx.Ai with LlamaIndex to Build Rag Applications,” Use Watsonx.ai with LlamaIndex to build RAG applications, May 28, 2024, https://community.ibm.com/community/user/watsonx/blogs/elena-lowery/2024/05/28/use-watsonxai-with-llamaindex-to-build-rag-applica.

2 Matt Stallone et al., “Scaling Granite Code Models to 128K Context,” arXiv.org, July 18, 2024, https://arxiv.org/abs/2407.13739.

3 Matt Stallone et al., “Scaling Granite Code Models to 128K Context.”

4 Kim Martineau, “What Is Retrieval-Augmented Generation (Rag)?,” IBM Research, May 1, 2024, https://research.ibm.com/blog/retrieval-augmented-generation-RAG.

5 “High-Level Concepts,” LlamaIndex, https://docs.llamaindex.ai/en/stable/getting_started/concepts/.

6 “Loading Data (Ingestion),” LlamaIndex, https://docs.llamaindex.ai/en/stable/understanding/loading/loading/.

7 Elena Lowery, “Use Watsonx.Ai with LlamaIndex to Build Rag Applications.”

8 Elena Lowery, “Use Watsonx.Ai with LlamaIndex to Build Rag Applications.”

9 “Query Engine,” LlamaIndex, https://docs.llamaindex.ai/en/latest/module_guides/deploying/query_engine/.

10 Jerry Liu, “Data Agents,” Medium, July 13, 2023, https://medium.com/llamaindex-blog/data-agents-eed797d7972f.

11 “Google,” LlamaIndex, https://docs.llamaindex.ai/en/stable/api_reference/tools/google/.

12 “Using LLMs,” LlamaIndex, https://docs.llamaindex.ai/en/latest/module_guides/models/llms/.

13 “Use Cases,” LlamaIndex, https://docs.llamaindex.ai/en/latest/use_cases/.

14 “Question-Answering (RAG),” LlamaIndex, https://docs.llamaindex.ai/en/stable/use_cases/q_and_a/.

15 “Agents,” LlamaIndex, https://docs.llamaindex.ai/en/stable/use_cases/agents/.