Langchain Prompt Caching

Author

Lead AI Advocate

What is prompt caching?

Prompt caching is a way to store and then reuse the responses generated from executed prompts when working with language models such as IBM Granite® models. If the same input (prompt) is encountered again, rather than making a new API call, the application will retrieve the previously stored response in the prompt cache.

Think of prompt caching as a kind of "memory" for your application. The system keeps results from previous queries in order to save computation time by not having to make repeated requests against the same input.

Why is it important?

Prompt caching is significant because it avoids repeated application programming interface (API) calls by reusing existing responses for identical repeated prompts. This ability results in faster response time, consistent output and lower usage of the API, which is helpful for staying within rate limits. It also helps to scale the flow and create resilience during outages. Prompt caching is a critical feature that adds value for any cost-effective, efficient and user friendly AI application.

Prerequisites

You need an IBM Cloud® account to create a watsonx.ai® project.
You also need Python version 3.12.7

Steps

Step 1: Set up your environment

While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.

Log in to watsonx.ai by using your IBM Cloud account.
Create a watsonx.ai project. You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook. This step opens a Jupyter Notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. To view more Granite tutorials, check out the IBM Granite Community.

Step 2: Set up a watsonx.ai Runtime instance and API key

Create a watsonx.ai Runtime service instance (choose the Lite plan, which is a free instance).
Generate an API Key.
Associate the watsonx.ai Runtime service to the project that you created in watsonx.ai.

Step 3: Installation of the packages

We need libraries to work with langchain framework and WatsonxLLM. Let's first install the required packages. This tutorial is built using Python 3.12.7

Note: If you are using older version of pip, you can use the command pip install --upgrade pip to easily install the latest packages that might not be compatible with older versions. But if you are already using the latest version or recently upgraded you packages, then you can skip this command.

!pip install -q langchain langchain-ibm langchain_experimental langchain-text-splitters langchain_chroma transformers bs4 langchain_huggingface sentence-transformers

Step 4: Import required libraries

os module is used to access environment variables, such as project credentials or API keys.

WatsonxLLM is a module from langchain_ibm that integrates IBM Watson LLM for generating outputs from generative AI models.

ChatWatsonx Enables chat-based interactions by using IBM watsonx through LangChain.

SimpleDirectoryReader is for loading and reading documents from a directory for indexing with LlamaIndex.

GenParams contains metadata keys for configuring Watsonx text generation parameters.

SQLiteCache enables setting up a local .cache.db SQLite database to avoid redundant API calls and speed up development and testing.

We need a few libraries and modules for this tutorial. Make sure to import the following ones and if they're not installed, a quick pip installation resolves the problem.

import os
import getpass
import requests
import random
import json
from typing import Dict, List
from langchain_ibm import WatsonxLLM
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from langchain_ibm import WatsonxLLM
from langchain_ibm import ChatWatsonx
from llama_index.core import SimpleDirectoryReader

Step 5: Read the text data

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(
input_files=["~/Artificial Intelligence/Generative_AI/files/FIle2.txt"],
).load_data()

document_text = documents[0].text
print(document_text[:200] + "...")

Step 6: Set up credentials

This code sets up credentials for accessing the IBM Watson Machine Learning (WML) API and helps ensure that the project ID is correctly configured.

A dictionary credentials is created with the WML service URL and API key. The API key is securely collected by using getpass.getpass to avoid exposing sensitive information.
the code tries to fetch the PROJECT_ID from environment variables by using os.environ. If the PROJECT_ID is not found, the user is prompted to manually enter it via input.

credentials = {
"url": "https://us-south.ml.cloud.ibm.com", # Replace with the correct region if needed
"apikey": getpass.getpass("Please enter your WML API key (hit enter): ")
}

# Set up project_id
try:
project_id = os.environ["PROJECT_ID"]
except KeyError:
project_id = input("Please enter your project_id (hit enter): ")

Step 7: Initialize large language model

This code initializes the IBM Watson LLM for use in the application:

This code creates an instance of WatsonxLLM by using the ibm/granite-3-8b-instruct model (Granite-3.1-8B-Instruct) designed for instruction-based generative AI tasks.
The url, apikey and project_id values from the previously set up credentials are passed to authenticate and connect to the IBM Watson LLM service.
Configures the max_new_tokens parameter to limit the number of tokens generated by the model in each response (2000 tokens in this case).

To learn more about model parameters such as the minimum and maximum token limits, refer to the documentation.

llm = WatsonxLLM(
model_id= "ibm/granite-3-8b-instruct",
url=URL,
apikey=WATSONX_APIKEY,
project_id=WATSONX_PROJECT_ID,
params={
GenParams.DECODING_METHOD: "greedy",
GenParams.TEMPERATURE: 0,
GenParams.MIN_NEW_TOKENS: 5,
GenParams.MAX_NEW_TOKENS: 2000,
GenParams.REPETITION_PENALTY:1.2,
GenParams.STOP_SEQUENCES: ["\n\n"]
}
)

Step 8: Set up SQLite cache for faster LLM responses

SQLiteCache is a persistent caching tool offered by LangChain that stores responses from LLM calls in a SQLite database file. SQLiteCache smartly cuts down on CPU time by storing costly computations, which means it focuses on retrieving data instead of recalculating it. Rather than going through the whole process again, it simply pulls results from the disk—making it efficient, reliable and reusable.

Figure illustrates with prompt caching, how results load instantly from disk; without it, every query wastes time on redundant computation.

from langchain.cache import SQLiteCache
from langchain.globals import set_llm_cache
set_llm_cache(SQLiteCache(database_path=".langchain.db"))

%%time
prompt = "System: You are a helpful assistant.\nUser: Why did Paul Graham start YC?\nAssistant:"
resp = llm.invoke(prompt)
print(resp)

In this case, CPU only worked for 22 ms, but the actual elapsed time was 1.43 seconds.

This example suggests most of the time was spent waiting, likely for I/O operations (for example, disk read and write, network access or API call)

Now, let's run the model a second time with the prompt and see the response time.

%%time
llm.predict(resp)

Clearly, using SQLiteCache, the CPU is used for just 7.26 ms, but wall time was 6.15 seconds.

This clearly points to blocking external dependencies (like waiting for a response from a server).

Conclusion

Prompt caching accelerates and reduces the cost of API requests to large language models, such as GPT-4o. Prompts cache content such as input tokens, output tokens, embeddings and messages from user, a system prompt or the output of a function, which now uses a cached content as opposed to network requests for a new revision. This method provides lower pricing, improved response latency and improved key performance indicators (KPIs).

Prompt caching can be beneficial for chatbots, RAG systems, fine-tuning and code assistants. A robust caching strategy that includes functions such as cache read, cache write, system message, cache control and proper time to live (TTL) will improve cache hit rates and lower cache miss rates.

Consistent use of the same prompt tokens, prompt prefix and uses system instructions, helps with consistent prompt performance in multiturn conversations and subsequent requests. Regardless of using Python, an SDK or working with OpenAI or another provider, understanding how prompt caching works will better enable you to implement prompt caching for use in many use cases.

Unlock the power of generative AI and ML

Learn how to confidently incorporate generative AI and machine learning into your business.

Implement prompt caching by using LangChain for building efficient LLM applications