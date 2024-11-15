Model Parameters

The numbers trailing the names of open-source LLMs denote the model’s parameters. For example Granite 3.0 8B Instruct, is a model with 8B parameters. Think of parameters as the conductors orchestrating how the model manipulates and understands the input data and produces outputs. They could manifest as weights or biases, influencing the significance of specific input features on the generated output.

A larger parameter count generally equates to a model with increased complexity and adaptability (although not strictly true across different architectures, generally true within a transformer architecture). A large language model with a higher parameter count can discern more intricate patterns from the data, paving the way for richer and more precise outputs. But, as with many things in life, there’s a trade-off. A surge in parameters means higher computational demands, greater memory needs, and a looming risk of over-fitting.

Model Types: instruct vs. code instruct vs. chat

Chat mode is designed for conversational contexts, while instruct mode is designed for natural language processing tasks in specific domains.

Fine-tuning in chat mode helps the LLM do a better job on generating natural and coherent responses that are relevant and engaging to the user. Fine-tuning in instruct mode helps the LLM do a better job on following different types of instructions and generating outputs that are accurate and appropriate to the task.

Model Settings

LLMs provide a handful of settings to 'configure' how responses are generated.

"temperature" setting determines how variable the model's responses are. Simply put, the lower the temperature the mode deterministic / consistent the model's responses will be. A very low temperature value, ideally 0,is recommended for RAG solutions.

"max_tokens / max_new_tokens" limits the number of tokens (a word is roughly equivalent to 1.5 tokens) the model will use in its response. Solution developers will need to experiment to find a value that balances complete answers with too much information for their use case but 100 is generally good limit for Q&A RAG solutions.

the sampling strategy determines how the model selects the next token in a response. RAG solutions should use a greedy sampling strategy, which will guarantee consistent responses to prompts.

Prompt Engineering

First of all, let's explore prompt rules to improve the performance of generation in the first place.

Rule #1: Start Simple

Do not start by writing a very long prompt, and only afterwards, go test it.

For example, do not start with a long prompt such as:

- You work in the Finance department of a major electronics company in the S&P 1000. You need to summarize quarterly shareholder meeting transcripts to identify key topics, trends and sentiment.

Reply in a bulleted numeric list format .

Ensure each item is a full and complete sentence.

Do not hallucinate. Only answer with information contained in the transcript.



Here is the transcript to summarize:

But start here:

- Summarize key topics contained in the following meeting transcript:

Rule #2: Increments Only

Do not make large changes in model parameters.

In most cases:

Minor changes in Temperature and Repetition penalty have noticeable impact.

Big changes often hide successes possible by small changes.

Apply best engineering principles:

Change only one parameter at a time. Validate each change separately.

Undo changes that don’t have the intended effect. Return to prior value.

Any changes from default should have a good explanation.

Rule #3: Cross Validate

Try to break your prompt. Don’t test your prompt once and claim success. Run dozens of test against your prompt.

Try to break your prompt before the customer does

Build a test dataset and keep adding your examples. After every POC release, retest to ensure your prompt continues to work.

Rule #4: Complex extractions cannot be performed by single prompts

No worries, multiple prompts are processed in-parallel using watsonx.ai