Prompt caching is a way to store and then reuse the responses generated from executed prompts when working with language models such as IBM Granite® models. If the same input (prompt) is encountered again, rather than making a new API call, the application will retrieve the previously stored response in the prompt cache.
Think of prompt caching as a kind of "memory" for your application. The system keeps results from previous queries in order to save computation time by not having to make repeated requests against the same input.
Prompt caching is significant because it avoids repeated application programming interface (API) calls by reusing existing responses for identical repeated prompts. This ability results in faster response time, consistent output and lower usage of the API, which is helpful for staying within rate limits. It also helps to scale the flow and create resilience during outages. Prompt caching is a critical feature that adds value for any cost-effective, efficient and user friendly AI application.
You need an IBM Cloud® account to create a watsonx.ai® project.
You also need Python version 3.12.7
While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.
Log in to watsonx.ai by using your IBM Cloud account.
Create a watsonx.ai project. You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook. This step opens a Jupyter Notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. To view more Granite tutorials, check out the IBM Granite Community.
Create a watsonx.ai Runtime service instance (choose the Lite plan, which is a free instance).
Generate an API Key.
Associate the watsonx.ai Runtime service to the project that you created in watsonx.ai.
We need libraries to work with langchain framework and WatsonxLLM. Let's first install the required packages. This tutorial is built using Python 3.12.7
Note: If you are using older version of pip, you can use the command pip install --upgrade pip to easily install the latest packages that might not be compatible with older versions. But if you are already using the latest version or recently upgraded you packages, then you can skip this command.
os module is used to access environment variables, such as project credentials or API keys.
WatsonxLLM is a module from langchain_ibm that integrates IBM Watson LLM for generating outputs from generative AI models.
ChatWatsonx Enables chat-based interactions by using IBM watsonx through LangChain.
SimpleDirectoryReader is for loading and reading documents from a directory for indexing with LlamaIndex.
GenParams contains metadata keys for configuring Watsonx text generation parameters.
SQLiteCache enables setting up a local .cache.db SQLite database to avoid redundant API calls and speed up development and testing.
We need a few libraries and modules for this tutorial. Make sure to import the following ones and if they're not installed, a quick pip installation resolves the problem.
This code sets up credentials for accessing the IBM Watson Machine Learning (WML) API and helps ensure that the project ID is correctly configured.
This code initializes the IBM Watson LLM for use in the application:
To learn more about model parameters such as the minimum and maximum token limits, refer to the documentation.
SQLiteCache is a persistent caching tool offered by LangChain that stores responses from LLM calls in a SQLite database file. SQLiteCache smartly cuts down on CPU time by storing costly computations, which means it focuses on retrieving data instead of recalculating it. Rather than going through the whole process again, it simply pulls results from the disk—making it efficient, reliable and reusable.
Figure illustrates with prompt caching, how results load instantly from disk; without it, every query wastes time on redundant computation.
In this case, CPU only worked for 22 ms, but the actual elapsed time was 1.43 seconds.
This example suggests most of the time was spent waiting, likely for I/O operations (for example, disk read and write, network access or API call)
Now, let's run the model a second time with the prompt and see the response time.
Clearly, using SQLiteCache, the CPU is used for just 7.26 ms, but wall time was 6.15 seconds.
This clearly points to blocking external dependencies (like waiting for a response from a server).
Prompt caching accelerates and reduces the cost of API requests to large language models, such as GPT-4o. Prompts cache content such as input tokens, output tokens, embeddings and messages from user, a system prompt or the output of a function, which now uses a cached content as opposed to network requests for a new revision. This method provides lower pricing, improved response latency and improved key performance indicators (KPIs).
Prompt caching can be beneficial for chatbots, RAG systems, fine-tuning and code assistants. A robust caching strategy that includes functions such as cache read, cache write, system message, cache control and proper time to live (TTL) will improve cache hit rates and lower cache miss rates.
Consistent use of the same prompt tokens, prompt prefix and uses system instructions, helps with consistent prompt performance in multiturn conversations and subsequent requests. Regardless of using Python, an SDK or working with OpenAI or another provider, understanding how prompt caching works will better enable you to implement prompt caching for use in many use cases.
