What is LLM Temperature?

16 December 2024

 

Authors

Jacob Murel Ph.D.

Senior Technical Content Creator

Joshua Noble

Data Scientist

In artificial intelligence (AI) and machine learning, temperature is a parameter for adjusting the output of large language models (LLMs). Temperature controls the randomness of text that is generated by LLMs during inference. LLMs generate text by predicting the next word (or rather, the next token) according to a probability distribution. Each token is assigned a logit (numerical value) from the LLM and the total set of tokens is normalized into a “softmax probability distribution.” Each token is assigned a “softmax function” that exists between zero and one, and the sum of all the tokens’ softmax probabilities is one. The LLM temperature parameter modifies this distribution. A lower temperature essentially makes those tokens with the highest probability more likely to be selected; a higher temperature increases a model's likelihood of selecting less probable tokens. This happens because a higher temperature value introduces more variability into the LLM's token selection. Different temperature settings essentially introduce different levels of randomness when a generative AI model outputs text.

Temperature is a crucial feature for controlling randomness in model performance. It allows users to adjust the LLM output to better suit different real-world applications of text generation. More specifically, this LLM setting allows users to balance coherence and creativity when generating output for a specific use case. For instance, a low temperature might be preferable for tasks requiring precision and factual accuracy, such as technical documentation or conversational replies with chatbots. The lower temperature value helps the LLM to produce more coherent and consistent text and avoid irrelevant responses. By contrast, a high temperature is preferable for creative outputs or creative tasks such as creative writing or concept brainstorming. The temperature setting effectively allows users to fine-tune LLMs and adjust a model's output to their own desired outcome.

Temperature is often conflated with ‘creativity’ but this isn’t always the case. It’s more helpful to think of it as how broadly the model uses text from its training data. Max Peeperkorn et al1 conducted an empirical analysis of LLM output for different temperature values and wrote:

“We find that temperature is weakly correlated with novelty, and unsurprisingly, moderately correlated with incoherence, but there is no relationship with either cohesion or typicality. However, the influence of temperature on creativity is far more nuanced and weak than suggested by the "creativity parameter" claim; overall results suggest that the LLM generates slightly more novel outputs as temperatures get higher.”
 

A high temperature value can make model outputs seem more creative but it's more accurate to think of them as being less determined by the training data.

Configuring temperature

Many APIs to access models including IBM® Granite™ Instruct or OpenAIs ChatGPT have parameters to configure temperature along with a variety of other LLM parameters. The three most common are:

do_sample: This parameter controls whether the model samples during text generation. Sampling is a method to vary text output. When set to "True" the model randomly samples from redacted token probabilities rather than always selecting the most probable word from a sequence in a dataset. In fact, we need to have this set to true to enable temperature adjustments for the pretrained LLM.

top_k: This parameter restricts the model's possible choices when random sampling to the top k most likely tokens. While the previous parameter enables random sampling to other predicted tokens beyond the most likely, this parameter limits the number of potential tokens from which the model selects. While random sampling helps produce more varied and diverse outputs, this parameter helps maintain the quality of generated text by excluding the more unlikely tokens from being sampled.

top_p: This parameter is sometimes also called nucleus sampling. It is another method of limiting the choices of random sampling to avoid inconsistent and nonsensical output. This parameter allows the model to consider tokens whose cumulative probability is greater than a specified probability value. When selecting tokens for the generated output the model only selects a group of tokens whose total probability is more than, for instance, 95%. While random sampling enables the model to have a more dynamic output, the top p parameter ensures that the output maintains some coherence and consistency.

Controlling output

Oftentimes models allow users to control the output from the model more explicitly as well. Some commonly used parameters for output control are:

Maximum length: The maximum length is the total # of tokens the AI is allowed to generate. This setting is useful because it allows users to manage the length of the model's response and can prevent overly long or irrelevant responses. 

Stop sequences: These sequences tell the model when to stop generating output and helps to control content length and structure. Prompting an LLM to write an email using "Best regards," or "Sincerely," as a stop sequence tells the model to stop before the closing salutation. This prompt can help keep the email short and to the point. Stop sequences are useful for output that you expect to come out in a structured format such as an email, a numbered list or dialog.

Frequency penalty: A frequency penalty is a setting that discourages repetition in the generated text by penalizing tokens proportionally to how frequently they appear. The more often a token is used in the text, the less likely the LLM is to use it again.

Presence penalty: The presence penalty is similar to the frequency penalty, but penalizes tokens based on whether they have occurred or not rather than penalizing them proportionally.

Comparing outputs with IBM Granite

To see how temperature affects the output of an LLM, look at the outputs created by the IBM Granite 3.1 2 billion parameter model. The prompt provided here was: “Write a story about a data scientist who loves Python.”

With a temperature of 0.1 Granite generates:

Once upon a time, in the bustling city of San Francisco, there lived a brilliant data scientist named Alex. Alex was known for their exceptional skills in Python programming and data analysis. They spent their days and nights immersed in the world of data, using Python to extract insights and make predictions.

One day, Alex was tasked with analyzing customer churn for a popular streaming service. They used Python's scikit-learn library to build a machine learning model that could predict which customers were likely to cancel their subscriptions.

With a temperature of 0.75 Granite generates:

Once upon a time, in the bustling city of San Francisco, there lived a brilliant data scientist named Alex. Alex was known for their deep love for Python, a programming language that they believed was the perfect tool for data analysis and machine learning.

Every day, Alex would wake up early, brew a pot of coffee, and sit down at their desk. They would spend hours writing code in Python, exploring datasets, and building predictive models. Their love for Python was not just about the language itself, but also about the community that had grown around it.

With a temperature of 1.25 Granite generates:

Once upon a time, in a bustling city filled with towering skyscrapers and neon lights, there was a data scientist who loved Python more than anything in the world. The data scientist has a feline helper named Suki. Suki is a curious and playful cat, always eager to explore every nook and cranny of the house.

One day, the data scientist is working on a complex coding problem. Suki keeps meowing loudly as if she is trying to communicate a message to the data scientist. The data scientist notices that the loud meowing usually happens when they are close to a solution to the complex coding problem.

Notice that even though each story starts with a similar beginning, increasing the temperature increases the variability of the text and the details that are generated.

Footnotes

[1] Max Peeperkorn, Tom Kouwenhoven, Dan Brown, and Anna Jordanous, Is Temperature the Creativity Parameter of Large Language Models?, 2024, https://arxiv.org/abs/2405.00492 (link resides outside ibm.com)

Related solutions
IBM watsonx.ai

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Discover watsonx.ai
Artificial intelligence solutions

Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.

Explore watsonx.ai Book a live demo