Answer Generation

Overview

Answer generation is the component of a RAG solution that creates a response to a user's query using the information retrieved from the targeted enterprise data sources or text corpus.

Considerations

Model Parameters

The numbers trailing the names of open-source LLMs denote the model’s parameters. For example Granite 3.0 8B Instruct, is a model with 8B parameters. Think of parameters as the conductors orchestrating how the model manipulates and understands the input data and produces outputs. They could manifest as weights or biases, influencing the significance of specific input features on the generated output.

A larger parameter count generally equates to a model with increased complexity and adaptability (although not strictly true across different architectures, generally true within a transformer architecture). A large language model with a higher parameter count can discern more intricate patterns from the data, paving the way for richer and more precise outputs. But, as with many things in life, there’s a trade-off. A surge in parameters means higher computational demands, greater memory needs, and a looming risk of over-fitting.

Model Types: instruct vs. code instruct vs. chat

Chat mode is designed for conversational contexts, while instruct mode is designed for natural language processing tasks in specific domains.

Fine-tuning in chat mode helps the LLM do a better job on generating natural and coherent responses that are relevant and engaging to the user. Fine-tuning in instruct mode helps the LLM do a better job on following different types of instructions and generating outputs that are accurate and appropriate to the task.

Model Settings

LLMs provide a handful of settings to 'configure' how responses are generated.

"temperature" setting determines how variable the model's responses are. Simply put, the lower the temperature the mode deterministic / consistent the model's responses will be. A very low temperature value, ideally 0,is recommended for RAG solutions.

"max_tokens / max_new_tokens" limits the number of tokens (a word is roughly equivalent to 1.5 tokens) the model will use in its response. Solution developers will need to experiment to find a value that balances complete answers with too much information for their use case but 100 is generally good limit for Q&A RAG solutions.

the sampling strategy determines how the model selects the next token in a response. RAG solutions should use a greedy sampling strategy, which will guarantee consistent responses to prompts.

Prompt Engineering

First of all, let's explore prompt rules to improve the performance of generation in the first place.

Rule #1: Start Simple

Do not start by writing a very long prompt, and only afterwards, go test it.

For example, do not start with a long prompt such as:

- You work in the Finance department of a major electronics company in the S&P 1000. You need to summarize quarterly shareholder meeting transcripts to identify key topics, trends and sentiment.

Reply in a bulleted numeric list format .

Ensure each item is a full and complete sentence.
Do not hallucinate. Only answer with information contained in the transcript.

Here is the transcript to summarize:

But start here:

- Summarize key topics contained in the following meeting transcript:

Rule #2: Increments Only

Do not make large changes in model parameters.

In most cases:

Minor changes in Temperature and Repetition penalty have noticeable impact.
Big changes often hide successes possible by small changes.

Apply best engineering principles:

Change only one parameter at a time. Validate each change separately.
Undo changes that don’t have the intended effect. Return to prior value.
Any changes from default should have a good explanation.

Rule #3: Cross Validate

Try to break your prompt. Don’t test your prompt once and claim success. Run dozens of test against your prompt.

Try to break your prompt before the customer does

Build a test dataset and keep adding your examples. After every POC release, retest to ensure your prompt continues to work.

Rule #4: Complex extractions cannot be performed by single prompts

No worries, multiple prompts are processed in-parallel using watsonx.ai

The IBM RAG Cookbook

Rule #5: Prompt Chains

Break tasks into smaller sub-tasks .

The IBM RAG Cookbook

Rule #6: Task Triage (or Prompt Architecture)

Separate each step in the process across multiple prompts and their specialized model.

Rule #7: granite.13b.chat.v

Chat models require different prompt design from instruct models.

Rule # 8: Have fun and play with the models

Seriously, we learn as much from play as from work. Try the latest LLMs and challenge them with tasks outside your normal work.

Choose something the LLM could likely accomplish.

Write a 4 sentence poem about birds.
Chat challenge: OK, now about dogs
Tell me a funny story about a bird named Midori in less than 5 sentences.
Chat challenge: Change the bird’s name to Charlie and add the color blue plus make it 7 sentences long.
Answer the following question in only 10 words: “why is the sky blue?”
Chat challenge: Great, tell me a reason that sounds true but isn’t.

Add multi-level interactions to experiment with chaining your prompts.

System prompts & User prompts

Editing the system prompt is equally important to editing the user prompt. The system prompt can make huge differences in the quality of the answer, the tone, etc. System prompts set the context for the interaction, guiding the model's behavior and ensuring consistency, while user prompts drive the specific content of the conversation. It's crucial to have the ability to edit system prompts to enhance the performance .

System Prompts

System prompts provide the underlying instructions that guide the AI's behavior throughout the interaction. They establish the model's role, tone, and ethical boundaries. Using Delimiters can help the model and avoid Prompt Injections. Delimiters specify where in the prompt you’re inserting the user input, helping the model identify where the user input starts and ends. Delimiters also help avoid prompt injections. Delimiters can be anything like: ``**, **"""**, **< >**, ** `.

Here is an example of a system prompt for the Granite model:

<|system|>\nYou are Granite Chat, an AI language model developed by IBM. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior. You always respond to greetings (for example, hi, hello, g\'day, morning, afternoon, evening, night, what\'s up, nice to meet you, sup, etc) with "Hello! I am Granite Chat, created by IBM. How can I help you today?". Please do not say anything else and do not start a conversation. {instruction}\n{session_history}<|user|>\n{query}\n<|assistant|>\n

User Prompts

User prompts are the specific instructions or queries provided by the user to achieve the desired response from the model. They should be clear and concise to ensure the model understands the task.

IBM Tools

Prompt Lab

Prompt Lab is an IBM platform that allows you to work with foundation models and build prompts using prompt engineering. Within the Prompt Lab, users can interact with foundation models in the prompt editor using the Chat, Freeform or Structured mode. These multiple options will allow you to craft the best model configurations to support a range of Natural Language Processing (NLP) type tasks including question answering, content generation and summarization, text classification and extraction.

For information on how to get started and everything else Prompt Lab please refer to the Official Prompt Lab Site

InstructLab

An emerging strategy to improve the accuracy of a RAG solution is to use a generation model that is fine tuned on the data corpus. This can lead to improvements in the end to end accuracy for several reasons. The most siginificant coming from the ability for the generation stage can improve and even correct the retrieval results. It can also provide more enterprise specific relevent answers since it can be tuned on your enterprise business content.

For information on how to get strated and everything else InstructLab please refer to the Official InstructLab Repository

Granite

Granite 3.0 includes a range of models, such as the Granite 3.0 8B Instruct, 2B Instruct, 8B Base, and 2B Base, which have been trained on over 12 trillion tokens across 12 natural languages and 116 programming languages. These models match or outperform similarly sized models from leading providers on both academic and enterprise benchmarks, showcasing strong performance in tasks such as language understanding, code generation, and document summarization.

Granite 3.0 Differentiators

Transparency and Safety

IBM's commitment to transparency and safety is evident in the detailed disclosure of the training datasets, filtering, and curation processes in the Granite 3.0 technical report. The models are released under the permissive Apache 2.0 license, ensuring flexibility and autonomy for enterprise clients and the broader AI community. Additionally, the Granite Guardian 3.0 models provide comprehensive risk and harm detection capabilities, outperforming other safety models in the market.

Efficiency and Cost-Effectiveness

Granite 3.0 models are engineered to be cost-efficient, allowing enterprises to achieve frontier model performance at a fraction of the cost. The use of InstructLab, a collaborative open-source approach, enables fine-tuning smaller models to specific tasks, reducing costs by 3x-23x compared to larger models. The Mixture of Experts (MoE) Architecture models, such as the Granite 3.0 3B-A800M and 1B-A400M, offer high inference efficiency with minimal performance trade-offs, making them ideal for low-latency applications and CPU-based deployments.

Multimodal Capabilities and Future Updates

By the end of 2024, the Granite 3.0 models are expected to support an extended 128K context window and multimodal document understanding capabilities, including image-in, text-out tasks. This expansion will further enhance their utility in various enterprise use cases.

Ecosystem Integration

The Granite 3.0 models are available through multiple platforms, including IBM's watsonx.ai, Hugging Face, Google Cloud's Vertex AI, NVIDIA NIM microservices, Ollama, and Replicate, providing developers with a wide range of deployment options and ensuring seamless integration with existing workflows. In summary, IBM's Granite 3.0 models offer a powerful, transparent, and cost-effective solution for enterprise AI, combining state-of-the-art performance with robust safety features and extensive ecosystem support.

For information on how to get started and everything else Granite please refer to the Official Granite Site

Tips & Recommendations

Better Prompting

The content and structure of the prompts submitted to LLMs can greatly affect the quality and fidelity of the responses they generate.

The Six Prompt Types

Prompts can be divided into six broad types:

Keyword Only

Keyword only queries are self-explanatory; they are prompts made up of keywords related to the topic in question. For example, the prompt:

What is the optimal baking time for a chocolate cake at high altitude

can be reduced to the keywords:

baking time chocolate cake high altitude

Keyword only prompts typically generate broader, more topic-focussed responses strong relevance of retrieved augmentation text. This can be a desirable effect for a solution requiring broad, informative responses such as an operator assistant chatbot but may generate too much information for solutions requiring succinct, conversational responses.

Comparative

Comparative prompts ask the LLM to draw comparisons between one or more topics or concepts in the prompt. For example:

What is the difference between a rule and a role assignment ?

Comparative prompts can be useful when the generation model and the supporting corpus have information on all of the concepts, as well as obvious dimensions or supporting material on how the concepts can be usefully compared. Without these a comparative prompt is likely to generate a nonsensical or ineffective response.

Aggregated

Aggregated prompts ask the model to aggregate two or more concepts or queries in a single prompt. For example:

How do I create a permission and a permission group ?

Aggregated prompts are generally more difficult for LLMs because of inherent ambiguity in the question.

Is it about multiple concepts to be executed in parallel to minimize processing time ?
Is it about complementary concepts that reinforce each other ?
Is the intent to explain how to perform actions in a sequential series of steps ?

To resolve this ambiguity, it is generally recommended to split aggregated prompts into multiple single topic prompts, or to provide additional context that explicitly describes the desired response.

Ambiguous

An ambiguous prompt is one that can be interpreted in multiple ways. For example:

What is a role?

This prompt could generate responses about roles in the context of acting and a subsequent discussion of famous actors and their roles, roles in the context of an IT security solution, or roles in the context of an organization. The uncertainty they introduce makes ambiguous prompts a poor choice for most any solution and they should be revised to include additional detail and context to focus the LLM on the desired topic.

Deviant

In the context of a RAG solution a deviant prompt is one without an answer in the corpus of supporting documents. This can lead to hallucinated / improvised responses based on the knowledge embedded in the LLM, or irrelevant responses based on low relevance search results from the corpus.

While it can be impossible to guard against deviant responses in all use cases (eg. a conversational chatbot), solution developers can minimize them by constraining user inputs through choice lists, eg. "I can help you with…" followed by a series of pre-define topic buttons, or 'nudging' users towards safe responses through descriptive or directive language in user directions and model responses.

Indirect

Indirect prompts are prompts whose keywords are not directly in the corpus but have synonyms that are. Indirect prompts are challenging for solution developers as their reliability depends on the number of synonyms there are for the prompt keywords, and how the synonyms lead to a unambiguous query.

Developers are advised to implement a blacklisting query-filtering mechanism to add disambiguating detail to the keywords, or to replace 'troublesome' keywords with unambiguous synonyms or phrases.

Good Prompting Practices

Though the creation of 'good' prompts is as much an art as it is a science there are a number of accepted practices that lead to better prompts and thus better results.

A good prompt is:

Relevant to the target domain and contain enough detail and context to make the content and tone of the desired output clear and specific.
Tailored to the target audience. A response intended for a corporate financial analyst will have a very different tone, structure, and content compared to a response for a retail banking client. A good prompt will have instructions ("You are a mid-level financial analyst. Generate a report…") or clear indication of the audience ("… written for a retail banking customer who has no knowledge of financial terminology") for the response.
Designed for a specific use case. A good prompt should be designed with a specific use case in mind and include details around where and how the generated response will be used ("… the response will be published in an travel industry magazine for tour operators"); which will in turn influence the tone, language, and style of the generated response.