Techniques for overcoming context length limitations

Each foundation model has a maximum context length, which is the maximum number of tokens that are allowed in the input plus the generated output for a prompt. If you frequently meet this limit when you use a foundation model to complete a task, try one of these workarounds.

The maximum context length varies from 4,096 to 131,072 tokens, depending on the foundation model. The context length limit can be a problem, especially when you use a foundation model for retrieval-augmented generation, summarization, or conversational tasks.

The following techniques can help you give a foundation model the context it needs to complete a task without exceeding the model's context window limit.

Sending fewer tokens with RAG inputs

Retrieval-augmented generation (RAG) is a technique in which a foundation model is augmented with knowledge from external sources to generate text. In the retrieval step, relevant documents from an external source are identified from the user’s query. In the generation step, portions of those documents are included in the foundation model prompt to generate a response that is grounded in the retrieved documents.

If you send a prompt with too much grounding information, the input might exceed the foundation model's context window limit. Try some of the following techniques to work around this limitation.

For more information about RAG, see Retrieval-augmented generation (RAG).

Classify content by relevancy

Use a foundation model that is good at classifying content to first determine whether a document can effectively answer a question before it is included in a prompt.

For example, you can use a prompt such as, For each question and sample document, check whether the document contains the information that is required to answer the question. If the sample document appears to contain information that is useful for answering the question, respond 'Yes'. Otherwise, respond 'No'.

If you set the Max tokens parameter to 1, the model is more likely to return a Yes or No answer.

Use this classification method to find and remove irrelevant documents or sentences that are retrieved by search. Then, recombine the relevant content to send with the question in the foundation model prompt.

Summarize long documents

For long documents, use a foundation model to summarize sections of the document. You can then submit a summary of those summaries instead of feeding the entire document to the foundation model.

For example, to generate a useful summary from one long meeting transcript, you might break the transcript into many smaller chunks and summarize each one separately. Or you might search and extract segments with contributions from different speakers. Then, you can chain the summaries together as context for the prompt that you submit to the foundation model.

Generate text embeddings from the content

A text embedding is a numerical representation of a unit of information, such as a word or a sentence, as a vector of real-valued numbers.

To leverage text embeddings in a RAG use case, complete the following steps:

  1. Use a text splitter to chunk the content into meaningful segments.

    For example, you can use a recursive character splitter to chunk the document into segments that meet the syntax and character limit requirements set by the foundation model. See LangChain: Recursively split by character.

  2. Use an embedding model to convert the document chunks into vectors that capture the meaning of the document segment.

  3. Store the vectors that represent your document segments in a vector database.

  4. At run time, search the vector store by using keywords or a search query to retrieve the most relevant document segments to feed to the foundation model.

Converting text to embeddings decreases the size of your foundation model prompt without removing the grounding information that helps the model respond with factual answers.

For more information about how to use IBM embedding models for text conversion, see Using vectorized text with retrieval-augmented generation tasks.

Divide and conquer complex tasks

For complex use cases where different types of input are expected and need to be handled in different ways, apply a multiple-prompt approach.

  1. Create several targeted prompts, each one designed for optimal effectiveness with one type of input.

    For example, you might classify customer questions into the following types: a support issue, sales query, or product detail inquiry.

  2. Use a foundation model to classify the incoming input as one of the predefined input types.

  3. Route the input to the appropriate targeted prompt based on the input type.

    For example, for the product detail inquiry, you might search your website or other marketing content and include a summary in the prompt. For support issues, you might search your troubleshooting knowledge base, and include a workaround procedure in the prompt. And for sales inquiries, you might connect the customer with a seller.

Summarize dialog context

For conversational tasks, use a foundation model to summarize the context from previous dialog exchanges instead of retaining and resubmitting the chat history with each input. Using summarization reduces the number of tokens that are submitted to the foundation model with subsequent turns in the conversation.

Lower your prompt token count by prompt-tuning the model

Free up the tokens that might otherwise be used by complex instructions and examples that you include in your prompt by tuning the foundation model.

When you tune a foundation model, you show the model what to return in output for the types of input that you submit. You can submit zero-shot prompts to a tuned model and still get the output that you expect.

Tuning a model is useful for classification, summarization, or extraction tasks. Tuning is not the best solution to use when you need the foundation model output to include factual information.

For more information, see Tuning a foundation model.

Learn more

Parent topic: Prompt tips